Chapter 6. Best Practices
This section contains recommendations and best practices for using Spark with HDP 2.3.
Using SQLContext and HiveContext
There are two ways to create context in Spark SQL:
- The - SQLContextclass is the entry point into all Spark SQL functionality.
- The - HiveContextclass inherits from- SQLContextand implements a superset of the functionality provided by- SQLContext. Additional features include the ability to write queries using HiveQL, and the ability to read data from Hive tables.
Recommendation: use HiveContext (instead of SQLContext) whenever possible.
| ![[Note]](../common/images/admon/note.png) | Note | 
|---|---|
| In yarn-client mode on a secure cluster you can use HiveContext to access the Hive Metastore. HiveContext is not supported for yarn-cluster mode on a secure cluster. | 
Examples
The following functions work with both HiveContext &
        SQLContext:
    Avg()
    Sum()
The following functions work only with HiveContext:
    variance(col)
    var_pop(col)
    stddev_pop(col)
    stddev_samp(col)
    covar_samp(col1, col2)
For more information, see the Spark Programming Guide.
Guidelines for Determining Spark Memory Allocation
This section describes how to determine memory allocation for a JVM running the Spark executor.
To avoid memory
    issues, Spark uses 90% of the JVM heap by default. This percentage is controlled by
      spark.storage.safetyFraction.
Of this 90% of JVM allocation, Spark reserves memory for three purposes:
- Storing in-memory shuffle, 20% by default (controlled by s - park.shuffle.memoryFraction)
- Unroll - used to serialize/deserialize Spark objects to disk when they don’t fit in memory, 20% is default (controlled by - spark.storage.unrollFraction)
- Storing RDDs: 60% by default (controlled by - spark.storage.memoryFraction)
Example
If the JVM heap is 4GB, the total memory available for RDD storage is calculated as:
4GB x 0.9 X 0. 6 = 2.16 GB
Therefore, with the default configuration approximately one half of the Executor JVM heap is used for storing RDDs.
For additional information about Spark memory use, see the Apache Spark Hardware Provisioning recommendations.
Configuring YARN Memory Allocation for Spark
This section describes how to manually configure YARN memory allocation settings based on node hardware specifications.
YARN takes into account all of the available compute resources on each machine in the cluster, and negotiates resource requests from applications running in the cluster. YARN then provides processing capacity to each application by allocating containers. A container is the basic unit of processing capacity in YARN; it is an encapsulation of resource elements such as memory (RAM) and CPU.
In a Hadoop cluster, it is important to balance the usage of RAM, CPU cores, and disks so that processing is not constrained by any one of these cluster resources.
When determining the appropriate YARN memory configurations for SPARK, note the following values on each node:
- RAM (Amount of memory) 
- CORES (Number of CPU cores) 
Configuring Spark for yarn-cluster Deployment
      Mode
In yarn-cluster mode, the Spark driver runs inside an application master
      process that is managed by YARN on the cluster. The client can stop after initiating the
      application.
The following command starts a YARN client in yarn-cluster mode. The client
      will start the default Application Master. SparkPi will run as a child thread of the
      Application Master. The client will periodically poll the Application Master for status
      updates, which will be displayed in the console. The client will exist when the application
      stops running.
./bin/spark-submit --class org.apache.spark.examples.SparkPi \ --master yarn-cluster \ --num-executors 3 \ --driver-memory 4g \ --executor-memory 2g \ --executor-cores 1 \ lib/spark-examples*.jar 10
Configuring Spark for yarn-client Deployment
      Mode
In yarn-client mode, the driver runs in the client process. The application
      master is only used to request resources for YARN.
To launch a Spark application in yarn-client mode, replace
      yarn-cluster with yarn-client. For example:
./bin/spark-shell --num-executors 32 \ --executor-memory 24g \ --master yarn-client
Considerations
When configuring Spark on YARN, consider the following information:
- Executor processes will be not released if the job has not finished, even if they are no longer in use. Therefore, please do not overallocate executors above your estimated requirements. 
- Driver memory does not need to be large if the job does not aggregate much data (as with a - collect()action).
- There are tradeoffs between - num-executorsand- executor-memory. Large executor memory does not imply better performance, due to JVM garbage collection. Sometimes it is better to configur a larger number of small JVMs than a small number of large JVMs.

