How to properly submit kafka streaming pyspark job to Google Dataproc

Question

I'm attempting to submit a pyspark job via Dataproc UI and keep getting an error, it looks like it is not loading the kafka streaming package.

Here is the REST command provided by the UI in my job: POST /v1/projects/projectname/regions/global/jobs:submit/ { "projectId": "projectname", "job": { "placement": { "clusterName": "cluster-main" }, "reference": { "jobId": "job-33ab811a" }, "pysparkJob": { "mainPythonFileUri": "gs://projectname/streaming.py", "args": [ "--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0" ], "jarFileUris": [ "gs://projectname/spark-streaming-kafka-0-10_2.11-2.2.0.jar" ] } } }

I have tried to pass kafka package as both, args and a jar file.

Here is my code (streaming.py):

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json


sc = SparkContext()

spark = SparkSession.builder.master("local").appName("Spark-Kafka-Integration").getOrCreate()

# < ip > is masked
df = spark \
    .readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", "<ip>:9092") \
    .option("subscribe", "rsvps") \
    .option("startingOffsets", "earliest") \
    .load()
df.printSchema()

error: : java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html

full trace: https://pastebin.com/Uz3iGy2N

score 6 · Accepted Answer · answered Mar 11 '18 at 20:09

You're likely running into the issue where "--packages" is syntactic sugar in the spark-submit that interacts badly when higher-level tools (Dataproc) are programmatically invoking Spark submission, with an alternative syntax described in my response here: use an external library in pyspark job in a Spark cluster from google-dataproc

Long story short, you can use properties to specify the equivalent spark.jars.packages=org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0 in your Dataproc request instead of passing --properties in the job args.

How to properly submit kafka streaming pyspark job to Google Dataproc

1 Answers1