Adding JDBC driver to Spark on EMR

Question

I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the:

java.sql.SQLException: No suitable driver found for exception.

I tried the following things:

Use addJar to add the driver Jar explicitly from the code.
Using spark.executor.extraClassPath spark.driver.extraClassPath parameters.
Using spark.driver.userClassPathFirst=true, when I used this option I'm getting a different error because mix of dependencies with Spark, Anyway this option seems to be to aggressive if I just want to add a single JAR.

Could you please help me with that,how can I introduce the driver to the Spark cluster easily?

Thanks,

David

Source code of the application

val properties = new Properties()
properties.put("ssl", "***")
properties.put("user", "***")
properties.put("password", "***")
properties.put("account", "***")
properties.put("db", "***")
properties.put("schema", "***")
properties.put("driver", "***")

val conf = new SparkConf().setAppName("***")
      .setMaster("yarn-cluster")
      .setJars(JavaSparkContext.jarOfClass(this.getClass()))

val sc = new SparkContext(conf)
sc.addJar(args(0))
val sqlContext = new SQLContext(sc)

var df = sqlContext.read.jdbc(connectStr, "***", properties = properties)
df = df.select( Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***)
// Additional actions on df

Did you try to chip the jdbc jar inside your application jar? — eliasah, Sep 24 '15 at 08:19
Yes, I did. The jar is part of my project class path. When I execute the code locally it works good, only on Amazon EMR I experience difficulties. — Guest, Sep 24 '15 at 08:26
Can you try to edit your question with the content of your jar? — eliasah, Sep 24 '15 at 08:27
Sure. I'm not allowed to put my entire source code but I'll copy paste the main parts. — Guest, Sep 24 '15 at 08:38
Is it possible to know what is your the driver class in your `properties.put("driver", "***")` ? — eliasah, Sep 24 '15 at 09:50
Maybe the discussion of class paths will help from here https://github.com/awslabs/emr-bootstrap-actions/blob/master/spark/examples/using-hivecontext-yarn-cluster.md. — ChristopherB, Sep 26 '15 at 04:23

score 2 · Answer 1 · edited May 23 '17 at 11:45

I had the same problem. What ended working for me is to use the --driver-class-path parameter used with spark-submit.

The main thing is to add the entire spark class path to the --driver-class-path

Here are my steps:

I got the default driver class path by getting the value of the "spark.driver.extraClassPath" property from the Spark History Server under "Environment".
Copied the MySQL JAR file to each node in the EMR cluster.
Put the MySQL jar path at the front of the --driver-class-path argument to the spark-submit command and append the value of "spark.driver.extraClassPath" to it

My driver class path ended up looking like this:

--driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/:/usr/lib/hadoop-hdfs/:/usr/lib/hadoop-mapreduce/:/usr/lib/hadoop-yarn/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/*

This worked with EMR 4.1 using Java with Spark 1.5.0. I had already added the MySQL JAR as a dependency in the Maven pom.xml

You may also want to look at this answer as it seems like a cleaner solution. I haven't tried it myself.

score 1 · Answer 2 · edited Jan 21 '17 at 01:31

1

With EMR 5.2 I add any new jars to the original driver classpath with:

export MY_DRIVER_CLASS_PATH=my_jdbc_jar.jar:some_other_jar.jar$(grep spark.driver.extraClassPath /etc/spark/conf/spark-defaults.conf | awk '{print $2}')

and after that

spark-submit --driver-class-path $MY_DRIVER_CLASS_PATH

edited Jan 21 '17 at 01:31

zx485

28,498
28
50
59

answered Jan 21 '17 at 01:06

bryanjj

53
6

wrschneider · Answer 3 · 2020-07-10T00:10:04.183

Following a similar pattern to this answer quoted above, this is how I automated installing a JDBC driver on EMR clusters. (Full automation is useful for transient clusters started and terminated per job.)

use a bootstrap action to install the JDBC driver on all EMR cluster nodes. Your bootstrap action will be a one-line shell script, stored in S3, that looks like

aws s3 cp s3://.../your-jdbc-driver.jar /home/hadoop

add a step to your EMR cluster before running your actual Spark job, to modify /etc/spark/conf/spark-defaults.conf

This will be another one-line shell script, stored in S3:

sudo sed -e 's,\(^spark.driver.extraClassPath.*$\),\1:/home/hadoop/your-jdbc-driver.jar,' -i /etc/spark/conf/spark-defaults.conf

The step itself will look like

{
    "name": "add JDBC driver to classpath",
    "jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
    "args": ["s3://...bucket.../set-spark-driver-classpath.sh"]
}

This will add your JDBC driver to spark.driver.extraClassPath

Explanation

you can't do both as bootstrap actions because Spark won't be installed yet, so no config file to update
you can't install the JDBC driver as a step, because you need the JDBC driver installed on the same path on all cluster nodes. In YARN cluster mode, the driver process does not necessarily run on the master node.
The configuration only needs to be updated on the master node, though, as the config is packed up and shipped whatever node ends up running the driver.

where do you get that `script-runner.jar`, is using add step on emr cluster similar to the step you configured there ? — billie class, Jul 20 '22 at 15:30

score 0 · Answer 4 · edited Jun 10 '22 at 14:11

In case you're using python in your EMR cluster there's no need for you to specify the jar while creating the cluster. You can add the jar package while creating your SparkSession.

    spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
    .config("spark.jars.packages", "mysql:mysql-connector-java:8.0.17") \
    .getOrCreate()

And then when you make your query mention the driver like this:

 form_df = spark.read.format("jdbc"). \
 option("url", "jdbc:mysql://yourdatabase"). \
 option("driver", "com.mysql.jdbc.Driver"). \

This way the package is included on the SparkSession as it is pulled from a maven repository. I hope it helps someone that is on the same situation I once was.

I'm getting java.lang.ClassNotFoundException: com.mysql.jdbc.Driver error. Do you have any idea? — ozgenbaris, Apr 04 '23 at 14:56

Adding JDBC driver to Spark on EMR

4 Answers4

Linked