Pyspark - what are the differences in behavior between `spark-submit --jars` and `sc._jsc.addJar('myjar.jar')`

Question

So, I have a PySpark program that runs fine with the following command:

spark-submit --jars terajdbc4.jar,tdgssconfig.jar --master local sparkyness.py

And yes its running on local mode and just executing on the master node.

I want to be able to launch my PySpark script though with just:

python sparkyness.py

So, I have added the following lines of code throughtout my PySpark script to facilitate that:

import findspark
findspark.init()



sconf.setMaster("local")



sc._jsc.addJar('/absolute/path/to/tdgssconfig.jar')
sc._jsc.addJar('/absolute/path/to/terajdbc4.jar')

This does not seem to be working though. Everytime I try to run the script with python sparkyness.py I get the error:

py4j.protocol.Py4JJavaError: An error occurred while calling o48.jdbc.
: java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

What is the difference between spark-submit --jars and sc._jsc.addJar('myjar.jar') and what could be causing this issue? Do I need to do more than just sc._jsc.addJar()?

Garren S · Accepted Answer · 2018-02-01T23:48:23.073

1

Use spark.jars when building the SparkSession

spark = SparkSession.builder.appName('my_awesome')\
    .config('spark.jars', '/absolute/path/to/jar')\
    .getOrCreate()

Related: Add Jar to standalone pyspark

Edit: I don't recommend hijacking the _jsc, because I don't think that handles distribution of jars to the driver and executors and adds to class path.

Example: I created a new SparkSession without the Hadoop AWS jar then tried to access S3 and here's the error (same error as when adding using sc._jsc.addJar):

Py4JJavaError: An error occurred while calling o35.parquet. : java.io.IOException: No FileSystem for scheme: s3

Then I created a session with the jar and got a new, expected error:

Py4JJavaError: An error occurred while calling o390.parquet. : java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).

edited Feb 01 '18 at 23:48

answered Feb 01 '18 at 23:31

Garren S

5,552
3
30
45

Nice! I am going to try this out! – timbram Feb 02 '18 at 16:17
General question. How did you find out about this option `spark.jars`? – timbram Feb 02 '18 at 17:28
Spark docs on configuration: https://spark.apache.org/docs/latest/configuration.html - note spark.jars and one of the most simple/powerful options "spark.jars.packages" which allows you to use Maven coordinates to automatically add the dependencies in a much better way than compiling a fat jar with dependencies – Garren S Feb 02 '18 at 17:48
Thanks bookmarked that link! And yes, that using that way works awesomely. I wasn't using the `SparkSpession.builder` way so I had to use the `.set()` method of `pyspark.SparkConf` – timbram Feb 02 '18 at 18:04
Out of scope for this question, but I would strongly recommend using SparkSession and Spark 2.x – Garren S Feb 02 '18 at 18:17
Thats fine, I love out of scope! :) . Any specific reason why you recommend it? Or just more because its the modern way? – timbram Feb 02 '18 at 18:20
Spark 2.x has many new features, mainly making DataFrames (PySpark) and Datasets (Scala/Java) the preferred data structures. SparkContext was subsumed by SparkContext. For PySpark, it makes the most sense to absolutely use only Spark 2.x because the performance benefits of using JVM backed SQL APIs like DF are in line with Scala now unlike RDDs which are not optimized, low level structures – Garren S Feb 02 '18 at 22:25

Pyspark - what are the differences in behavior between `spark-submit --jars` and `sc._jsc.addJar('myjar.jar')`

1 Answers1