Using Spark Streaming with Kafka on a Kerberos-Enabled Cluster
This section describes specific steps for developers using Spark Streaming with Kafka on a Kerberos-enabled cluster.
Adding the spark-kafka-streaming jar File to a Deployed Build
Before running Spark streaming jobs with Kafka in a Kerberos environment, you will need to add or retrieve the HDP spark-streaming-kafka jar file and associated jar files.
![]() | Note |
|---|---|
The spark-streaming-kafka jar file is required for running a job that is not a Spark example job provided with HDP. If you are running a job that is part of the Spark examples package installed by HDP, you will not need the spark-streaming-kafka jar. |
Instructions for Developing and Building Applications
If you are using maven as a compile tool:
Add the hortonworks repo to your
pom.xmlfile:<repository> <id>hortonworks</id> <name>hortonworks repo</name> <url>http://repo.hortonworks.com/content/repositories/releases/</url> </repository>Specify the Hortonworks version number for Spark streaming Kafka and streaming dependencies to your
pom.xmlfile:<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.10</artifactId> <version>1.6.2.2.4.2.0-90</version> </dependency> <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>1.6.2.2.4.2.0-90</version> <scope>provided</scope> </dependency>Note that the correct version number includes the Spark version and the HDP version.
(Optional) The default scope of the spark-streaming jar is "provided", which means that the jar is provided by the environment; it will not pack into an uber jar. (An uber jar packages all dependencies into one jar.) If you prefer to pack a uber jar, add the
maven-shade-pluginto yourpom.xmlfile:<plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <finalName>uber-${project.artifactId}-${project.version}</finalName> </configuration> </plugin>
Instructions for Submitting your Spark Streaming Job
Instructions for submitting your job depend on whether you used an uber jar or not:
If you kept the default jar scope and you can access an external network, use
--packagesto download dependencies in the runtime library:spark-submit --master yarn-client --num-executors 1 \ --packages org.apache.spark:spark-streaming-kafka_2.10:1.6.2.2.4.2.0-90 \ --repositories http://repo.hortonworks.com/content/repositories/releases/ \ --class <user-main-class> \ <user-application.jar> \ <user arg lists>
The artifact and repository locations should be the same as specified in your
pom.xmlfile.If you packed the jar into an uber jar, submit the jar like you would a regular Spark application:
spark-submit --master yarn-client --num-executors 1 \ --class <user-main-class> \ <user-uber-application.jar> \ <user arg lists>
For a sample pom.xml file, see Sample pom.xml file for Spark Streaming with Kafka.
Running Spark Streaming - Kafka Jobs on a Kerberos-Enabled Cluster
The following instructions assume that Spark and Kafka are already deployed on a Kerberos-enabled cluster.
Select or create a user account to be used as principal. This should not be the
kafkaorsparkservice account.Generate a keytab for the user.
Create a jaas configuration file (for example,
key.conf), and add configuration settings to specify the user keytab.The following example specifies keytab location
./v.keytabfor uservagrant@example.com.![[Note]](../common/images/admon/note.png)
Note The keytab and configuration files will be distributed using YARN local resources. They will end up in the current directory of the Spark YARN container, thus the location should be specified as
./v.keytab.KafkaClient { com.sun.security.auth.module.Krb5LoginModule required useKeyTab=true keyTab="./v.keytab" storeKey=true useTicketCache=false serviceName="kafka" principal="vagrant@EXAMPLE.COM"; };In your job submission instructions, pass the jaas configuration file and keytab as local resource files. Add the jaas configuration file options to the JVM options specified for the driver and executor:
![[Note]](../common/images/admon/note.png)
Note If you are running a job that is part of the Spark
examplespackage installed by HDP, you do not need to add the spark-streaming-kafka jar. Otherwise, add the spark-streaming-kafka jar using the--jarscommand-line option.--files key.conf#key.conf,v.keytab#v.keytab --driver-java-options "-Djava.security.auth.login.config=./key.conf" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./key.conf"
Pass any relevant Kafka security options to your streaming application. For example, the KafkaWordCount example accepts PLAINTEXTSASL as the last option in the command line:
KafkaWordCount /vagrant/spark-examples.jar c6402:2181 abc ts 1 PLAINTEXTSASL

