32

I'm launching a pyspark program:

$ export SPARK_HOME=
$ export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/lib/py4j-0.9-src.zip
$ python

And the py code:

from pyspark import SparkContext, SparkConf

SparkConf().setAppName("Example").setMaster("local[2]")
sc = SparkContext(conf=conf)

How do I add jar dependencies such as the Databricks csv jar? Using the command line, I can add the package like this:

$ pyspark/spark-submit --packages com.databricks:spark-csv_2.10:1.3.0 

But I'm not using any of these. The program is part of a larger workflow that is not using spark-submit I should be able to run my ./foo.py program and it should just work.

  • I know you can set the spark properties for extraClassPath but you have to copy JAR files to each node?
  • Tried conf.set("spark.jars", "jar1,jar2") that didn't work too with a py4j CNF exception
Hobo
  • 7,536
  • 5
  • 40
  • 50
Nora Olsen
  • 983
  • 2
  • 10
  • 22

7 Answers7

51

2021-01-19 Updated

There are many approaches here (setting ENV vars, adding to $SPARK_HOME/conf/spark-defaults.conf, etc...) other answers already cover these. I wanted to add an answer for those specifically wanting to do this from within a Python Script or Jupyter Notebook.

When you create the Spark session you can add a .config() that pulls in the specific Jar file (in my case I wanted the Kafka package loaded):

spark = SparkSession.builder.appName('my_awesome')\
    .config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\
    .getOrCreate()

Using this line of code I didn't need to do anything else (no ENVs or conf file changes).

  • Note 1: The JAR file will dynamically download, you don't need to manually download it.
  • Note 2: Make sure the versions match what you want, so in the example above my Spark version is 3.0.1 so I have :3.0.1 at the end.
Brian Wylie
  • 2,347
  • 28
  • 29
  • 5
    This option seems often ignored/undocumented elsewhere... as stated, this is a good solution for jupyter users. – Luke W Nov 20 '17 at 23:04
  • 4
    for jars, use 'spark.jars' – Saksham Aug 16 '18 at 12:12
  • 2
    This answer is perfect for anyone who is launching a Spark environment from code in general and needs to pull a jar during runtime. I'm successfully using this to load a GraphFrames jar onto some very limited-access systems which provide no way to build a custom SparkConf file. Thanks for the clear example! – bsplosion Dec 27 '18 at 20:31
  • 2
    @briford-wylie But did you have to download and place a jar file somewhere? I did a `jar -tvf fileName.jar | grep -i kafka` for each jar in the Spark `.../jars/` directory, and found nothing for kafka. Where was yours located? I'm not necessarily interested in kafka per-se; I'm just following your example to try to generalize it. – NYCeyes Dec 27 '18 at 23:37
  • If you want to add Multiple Jar packages, check this link https://stackoverflow.com/questions/57862801/spark-shell-add-multiple-drivers-jars-to-classpath-using-spark-defaults-conf/65799134#65799134 – Pramod Kumar Sharma Jan 20 '21 at 18:14
  • I have to use this trick in pycharm. I set spark-defaults.conf with all packages but pycharm don't load in execution. Anyone knows why this happened? – Marco Antonio Yamada May 05 '21 at 11:09
  • I have checked this resolution is working on azure databricks also for adding xml jar – Rahul Pandey Jul 28 '21 at 21:55
  • Adding the graphframes jars aren't working this way as they moved from maven central to bintree. Is it possible to add a custom repo for this lookup? – user1264641 Aug 31 '21 at 16:13
14

Any dependencies can be passed using spark.jars.packages (setting spark.jars should work as well) property in the $SPARK_HOME/conf/spark-defaults.conf. It should be a comma separated list of coordinates.

And packages or classpath properties have to be set before JVM is started and this happens during SparkConf initialization. It means that SparkConf.set method cannot be used here.

Alternative approach is to set PYSPARK_SUBMIT_ARGS environment variable before SparkConf object is initialized:

import os
from pyspark import SparkConf

SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS

conf = SparkConf()
sc = SparkContext(conf=conf)
zero323
  • 322,348
  • 103
  • 959
  • 935
  • 2
    This solution seems not to work for me, at least within the notebook; I still get class-not-found errors. In fact, none of the environment variables I set seem to get picked up by Spark. It seems like `os.environ` sets the environment only for the process in which the python kernel is running, but any subprocesses don't pick up those environment variables. In other words, it's not doing the equivalent of `export ...`. Any thoughts? – santon Mar 04 '16 at 00:27
  • `subprocess.Popen` takes `env` argument where you can pass a copy of the current environment. – zero323 Mar 04 '16 at 00:34
  • This actually works, but I have no clue how this is working behind the scenes, setting the `PYSPARK_SUBMIT_ARGS` to an incomplete `SUBMIT_ARGS` without `spark-submit` automatically imports the packages when running the spark job. Are there any documentations surrounding how this works? – JohnWick Apr 09 '23 at 06:34
6

I encountered a similar issue for a different jar ("MongoDB Connector for Spark", mongo-spark-connector), but the big caveat was that I installed Spark via pyspark in conda (conda install pyspark). Therefore, all the assistance for Spark-specific answers weren't exactly helpful. For those of you installing with conda, here is the process that I cobbled together:

1) Find where your pyspark/jars are located. Mine were in this path: ~/anaconda2/pkgs/pyspark-2.3.0-py27_0/lib/python2.7/site-packages/pyspark/jars.

2) Download the jar file into the path found in step 1, from this location.

3) Now you should be able to run something like this (code taken from MongoDB official tutorial, using Briford Wylie's answer above):

from pyspark.sql import SparkSession

my_spark = SparkSession \
    .builder \
    .appName("myApp") \
    .config("spark.mongodb.input.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \
    .config("spark.mongodb.output.uri", "mongodb://127.0.0.1:27017/spark.test_pyspark_mbd_conn") \
    .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.2.2') \
    .getOrCreate()

Disclaimers:

1) I don't know if this answer is the right place/SO question to put this; please advise of a better place and I will move it.

2) If you think I have errored or have improvements to the process above, please comment and I will revise.

ximiki
  • 435
  • 6
  • 17
3

Finally found the answer after a multiple tries. The answer is specific to using spark-csv jar. Create a folder in you hard drive say D:\Spark\spark_jars. Place the following jars there:

  1. spark-csv_2.10-1.4.0.jar (this is the version I am using)
  2. commons-csv-1.1.jar
  3. univocity-parsers-1.5.1.jar

2 and 3 are dependencies required by spark-csv, hence those two files need to be downloaded too. Go to your conf directory where you have downloaded Spark. In the spark-defaults.conf file add the line:

spark.driver.extraClassPath D:/Spark/spark_jars/*

The asterisk should include all the jars. Now run Python, create SparkContext, SQLContext as you normally would. Now you should be able to use spark-csv as

sqlContext.read.format('com.databricks.spark.csv').\
options(header='true', inferschema='true').\
load('foobar.csv')
Indrajit
  • 2,152
  • 1
  • 14
  • 20
3

From working with PySpark, PostgreSQL, and Apache Sedona, I learned to solve this with 2 methods.


Method 1: Download the JAR file and add to spark.jars

In order to use PostgreSQL on Spark, I needed to add the JDBC driver (JAR file) to PySpark.

First, I created a jars directory in the same level as my program and store the postgresql-42.5.0.jar file there.

Then, I simply add this config to SparkSession with: SparkSession.builder.config("spark.jars", "{JAR_FILE_PATH}")

spark = (
    SparkSession.builder
    .config("spark.jars", "jars/postgresql-42.5.0.jar")
    .master("local[*]")
    .appName("Example - Add a JAR file")
    .getOrCreate()
)

Method 2: Use Maven Central coordinate and spark.jars.packages

If your dependency JAR files are available on Maven, you can use this method and not have to maintain any JAR file.

Steps

  1. Find your package on Maven Central Repository Search Example - postgresql

  2. Select the correct package artifact and copy the Maven Central coordinate Example - coordinate

  3. In Python, call SparkSession.builder.config("spark.jars.packages", "{MAVEN_CENTRAL_COORDINATE}").

    spark = (
        SparkSession.builder
        .appName('Example - adding many Maven packages')
    
        .config("spark.serializer", KryoSerializer.getName)
        .config("spark.kryo.registrator", SedonaKryoRegistrator.getName)
        .config("spark.jars.packages",
                "org.postgresql:postgresql:42.5.0,"
                + "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.1-incubating,"
                + "org.datasyslab:geotools-wrapper:1.1.0-25.2")
    
        .getOrCreate()
     )
    

Pros of using sparks.jars.packages

  • You can add several packages
  • You don't have to manage the fat JAR files

Cons of using sparks.jars.packages

The .config("sparks.jars.packages", ...) accept a single parameter, so in order to add several packages, you need to concatenate the package coordinates using , as the delimiter.

"org.postgresql:postgresql:42.5.0,"
+ "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.1-incubating,"
+ "org.datasyslab:geotools-wrapper:1.1.0-25.2"

*** The string will not tolerate next line, spaces, or tabs in your code and it will cause nasty bugs that gives out irrelevant error logs.

Quan Bui
  • 175
  • 1
  • 13
1
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.10.4-src.zip'))

Here it comes....

sys.path.insert(0, <PATH TO YOUR JAR>)

Then...

import pyspark
import numpy as np

from pyspark import SparkContext

sc = SparkContext("local[1]")
.
.
.
0

For sparkoperator in yml manifest you can use in sparkConf "spark.jars.packages" for severals packages

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: test
  namespace: default
spec:
  type: Python
  pythonVersion: "3"
  mode: cluster
  sparkVersion: "3.3.2"
  sparkConf:
    "spark.jars.packages": "org.apache.hadoop:hadoop-aws:3.3.2,com.amazonaws:aws-java-sdk-bundle:1.12.99"