0

What is the proper way to include external packages (jars) in a pyspark shell?

I am using pyspark from a jupyter notebook.

I would like to read from kafka using spark, via the spark-sql-kafka library, as explained here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying.

I am trying to import the library via the --packages option, set in the environment variable PYSPARK_SUBMIT_ARGS.

But

  • I am not sure about the exact version and name of the package to use,
  • I don't know whether I also need to include spark-streaming or not, whether I have to specify some repository with --repositories or not,
  • I don't know whether it's better to download the jar and specify local paths (do they have to be on the machine where jupyter is running, or on the machine where yarn is running? I'm using --master yarn and --deploy-mode client) or to rely on --packages
  • I don't know whether options specified after pyspark-shell in PYSPARK_SUBMIT_ARGS are left out or not (If I try to specify --packages options before pyspark-shell I can't instantiate the spark context at all)
  • How can I check whether some package was correctly downloaded and is available to be used
  • I don't know what is the route that such downloaded jars (or jars in general) take. How many times are they replicated? Do they pass through the driver? Do these things change if I'm using a cluster manager as YARN? Do they change if I'm using everything from a spark-shell in a jupyter notebook?

Resources I read so far:

aboger
  • 2,214
  • 6
  • 33
  • 47
Michele Piccolini
  • 2,634
  • 16
  • 29

2 Answers2

2

For simplicity, I would try to get things working outside Jupyter first


the exact version and name of the package to use

The version needs to match your Spark version. Use the package that you want to download.

I don't know whether I also need to include spark-streaming or not

Do not. It's already on the classpath of your Spark workers.

whether I have to specify some repository with --repositories or not,

If you are available to download files directly from Maven Central, then no.

whether it's better to download the jar and specify local paths

You should probably use --packages, which will download the files for you. The deploy-mode and cluster don't interfere with this.

whether options specified after pyspark-shell in PYSPARK_SUBMIT_ARGS are left out or not

Shouldn't be, though I typically see pyspark-shell as the last option.

import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages ... pyspark-shell'

import pyspark
from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext

sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)

How can I check whether some package was correctly downloaded

You would get a NoClassDefFound, for example when running if it wasn't downloaded.

what is the route that such downloaded jars (or jars in general) take

There's $SPARK_HOME/jars, but any --jars or --packages are cached in the ~/.m2 folder on each machine for the user running the job, usually, then symlinked into the running YARN container / Spark executor.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Hi, I'm trying to use Spark Cassandra Connector with PySpark while running code from Jupyter, When I run the above snippet code by adding my packages it ran fine (less than 1 second) but when I run the code it gave me no class found error. – Abdul Haseeb Dec 25 '20 at 08:07
  • 1
    it worked for me, I was using spark.session.getOrCreate() so it was picking the previously created spark session and new changings was not working. Thanks for the solution. – Abdul Haseeb Dec 25 '20 at 08:12
0

When you want to import external packages in Pyspark shell , During launching itself we can call it like way we are doing spark-submit.

> ./bin/pyspark --packages
> org.apache.spark:spark-sql-kafka-0-10_2.11:2.2.0,com.databricks:spark-avro_2.11:3.2.0
> --conf spark.ui.port=4055 --files /home/bdpda/spark_jaas,/home/bdpda/bdpda.headless.keytab --conf
> "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/home/bdpda/spark_jaas"
> --conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=/home/bdpda/spark_jaas

Note : This pyspark submit used for same use use case to connecting Pyspark with Kafka structured streaming .