adding packages to pyspark using jupyter notebook

Question

I am able to run jupyter with pyspark successfully using https://cloud.google.com/dataproc/tutorials/jupyter-notebook

My question is - if I had to add packages to pyspark (like spark-csv or graphframes) and use them through the notebook, what is the best practice to follow ? I can add the package in a new pyspark job using --packages option, but how do i connect that new pyspark context to the notebook ?

score 2 · Accepted Answer · edited May 23 '17 at 12:18

To get the notebook working, you'll really want the notebook setup to pick up the right packages itself. Since the initialization action you linked works to ensure Jupyter will be using the cluster's configured Spark directories and thus pick up all the necessary YARN/filesystem/lib configurations, the best way to do this is to add the property at cluster-creation time instead of job-submission time:

gcloud dataproc clusters create \
    --properties spark:spark.jars.packages=com.databricks:spark-csv_2.11:1.2.0

Per this StackOverflow error, setting the spark-defaults.conf property spark.jars.packages is the more portable equivalent of specifying the --packages option, since --packages is just syntactic sugar in spark-shell/spark-submit/pyspark wrappers which sets the spark.jars.packages configuration entry anyways.

adding packages to pyspark using jupyter notebook

1 Answers1