What is the proper way to include external packages (jars) in a pyspark shell?
I am using pyspark from a jupyter notebook.
I would like to read from kafka using spark, via the spark-sql-kafka
library, as explained here: https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#deploying.
I am trying to import the library via the --packages
option, set in the environment variable PYSPARK_SUBMIT_ARGS
.
But
- I am not sure about the exact version and name of the package to use,
- I don't know whether I also need to include spark-streaming or not, whether I have to specify some repository with
--repositories
or not, - I don't know whether it's better to download the jar and specify local paths (do they have to be on the machine where jupyter is running, or on the machine where yarn is running? I'm using
--master yarn
and--deploy-mode client
) or to rely on--packages
- I don't know whether options specified after
pyspark-shell
inPYSPARK_SUBMIT_ARGS
are left out or not (If I try to specify--packages
options beforepyspark-shell
I can't instantiate the spark context at all) - How can I check whether some package was correctly downloaded and is available to be used
- I don't know what is the route that such downloaded jars (or jars in general) take. How many times are they replicated? Do they pass through the driver? Do these things change if I'm using a cluster manager as YARN? Do they change if I'm using everything from a spark-shell in a jupyter notebook?
Resources I read so far:
Docs and guides:
Examples:
Issues and questions:
Repositories: