I am trying to run the following PySpark-Kafka streaming example in a Jupyter Notebook. Here is the first part of the code I am using in my notebook:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = pyspark.SparkContext(master='local[*]',appName="PySpark streaming")
ssc = StreamingContext(sc, 2)
topic = "my-topic"
brokers = "localhost:9092"
kvs = KafkaUtils.createDirectStream(ssc, [topic], {"metadata.broker.list": brokers})
If I run the cell, I receive the following error/description:
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.3.0 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.3.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
My questions are: how can I pass the --jars or --package argument to Jupyter Notebook? Or, can I download this package and link it permanently to Python/Jupyter (maybe via an export in the .bashrc)?