I'm attempting to submit a pyspark job via Dataproc UI and keep getting an error, it looks like it is not loading the kafka streaming package.
Here is the REST command provided by the UI in my job:
POST /v1/projects/projectname/regions/global/jobs:submit/
{
"projectId": "projectname",
"job": {
"placement": {
"clusterName": "cluster-main"
},
"reference": {
"jobId": "job-33ab811a"
},
"pysparkJob": {
"mainPythonFileUri": "gs://projectname/streaming.py",
"args": [
"--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0"
],
"jarFileUris": [
"gs://projectname/spark-streaming-kafka-0-10_2.11-2.2.0.jar"
]
}
}
}
I have tried to pass kafka package as both, args and a jar file.
Here is my code (streaming.py
):
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc = SparkContext()
spark = SparkSession.builder.master("local").appName("Spark-Kafka-Integration").getOrCreate()
# < ip > is masked
df = spark \
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "<ip>:9092") \
.option("subscribe", "rsvps") \
.option("startingOffsets", "earliest") \
.load()
df.printSchema()
error: : java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
full trace: https://pastebin.com/Uz3iGy2N