I am using the Jupyter notebook with Pyspark with the following docker image: Jupyter all-spark-notebook
Now I would like to write a pyspark streaming application which consumes messages from Kafka. In the Spark-Kafka Integration guide they describe how to deploy such an application using spark-submit (it requires linking an external jar - explanation is in 3. Deploying). But since I am using Jupyter notebook I never actually run the spark-submit
command, I assume it gets run in the back if I press execute.
In the spark-submit
command you can specify some parameters, one of them is -jars
, but it is not clear to me how I can set this parameter from the notebook (or externally via environment variables?). I am assuming I can link this external jar dynamically via the SparkConf
or the SparkContext
object. Has anyone experience on how to perform the linking properly from the notebook?