4

I'm trying to add a JDBC driver to a Spark cluster that is executing on top Amazon EMR but I keep getting the:

java.sql.SQLException: No suitable driver found for exception.

I tried the following things:

  1. Use addJar to add the driver Jar explicitly from the code.
  2. Using spark.executor.extraClassPath spark.driver.extraClassPath parameters.
  3. Using spark.driver.userClassPathFirst=true, when I used this option I'm getting a different error because mix of dependencies with Spark, Anyway this option seems to be to aggressive if I just want to add a single JAR.

Could you please help me with that,how can I introduce the driver to the Spark cluster easily?

Thanks,

David

Source code of the application

val properties = new Properties()
properties.put("ssl", "***")
properties.put("user", "***")
properties.put("password", "***")
properties.put("account", "***")
properties.put("db", "***")
properties.put("schema", "***")
properties.put("driver", "***")

val conf = new SparkConf().setAppName("***")
      .setMaster("yarn-cluster")
      .setJars(JavaSparkContext.jarOfClass(this.getClass()))

val sc = new SparkContext(conf)
sc.addJar(args(0))
val sqlContext = new SQLContext(sc)

var df = sqlContext.read.jdbc(connectStr, "***", properties = properties)
df = df.select( Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***,
                Constants.***)
// Additional actions on df
Guest
  • 41
  • 1
  • 3

4 Answers4

2

I had the same problem. What ended working for me is to use the --driver-class-path parameter used with spark-submit.

The main thing is to add the entire spark class path to the --driver-class-path

Here are my steps:

  1. I got the default driver class path by getting the value of the "spark.driver.extraClassPath" property from the Spark History Server under "Environment".
  2. Copied the MySQL JAR file to each node in the EMR cluster.
  3. Put the MySQL jar path at the front of the --driver-class-path argument to the spark-submit command and append the value of "spark.driver.extraClassPath" to it

My driver class path ended up looking like this:

--driver-class-path /home/hadoop/jars/mysql-connector-java-5.1.35.jar:/etc/hadoop/conf:/usr/lib/hadoop/:/usr/lib/hadoop-hdfs/:/usr/lib/hadoop-mapreduce/:/usr/lib/hadoop-yarn/:/usr/lib/hadoop-lzo/lib/:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/:/usr/share/aws/emr/emrfs/auxlib/*

This worked with EMR 4.1 using Java with Spark 1.5.0. I had already added the MySQL JAR as a dependency in the Maven pom.xml

You may also want to look at this answer as it seems like a cleaner solution. I haven't tried it myself.

Community
  • 1
  • 1
Fab
  • 665
  • 3
  • 9
1

With EMR 5.2 I add any new jars to the original driver classpath with:

export MY_DRIVER_CLASS_PATH=my_jdbc_jar.jar:some_other_jar.jar$(grep spark.driver.extraClassPath /etc/spark/conf/spark-defaults.conf | awk '{print $2}')

and after that

spark-submit --driver-class-path $MY_DRIVER_CLASS_PATH
zx485
  • 28,498
  • 28
  • 50
  • 59
bryanjj
  • 53
  • 6
1

Following a similar pattern to this answer quoted above, this is how I automated installing a JDBC driver on EMR clusters. (Full automation is useful for transient clusters started and terminated per job.)

  • use a bootstrap action to install the JDBC driver on all EMR cluster nodes. Your bootstrap action will be a one-line shell script, stored in S3, that looks like
aws s3 cp s3://.../your-jdbc-driver.jar /home/hadoop
  • add a step to your EMR cluster before running your actual Spark job, to modify /etc/spark/conf/spark-defaults.conf

This will be another one-line shell script, stored in S3:

sudo sed -e 's,\(^spark.driver.extraClassPath.*$\),\1:/home/hadoop/your-jdbc-driver.jar,' -i /etc/spark/conf/spark-defaults.conf

The step itself will look like

{
    "name": "add JDBC driver to classpath",
    "jar": "s3://us-east-1.elasticmapreduce/libs/script-runner/script-runner.jar",
    "args": ["s3://...bucket.../set-spark-driver-classpath.sh"]
}

This will add your JDBC driver to spark.driver.extraClassPath

Explanation

  • you can't do both as bootstrap actions because Spark won't be installed yet, so no config file to update

  • you can't install the JDBC driver as a step, because you need the JDBC driver installed on the same path on all cluster nodes. In YARN cluster mode, the driver process does not necessarily run on the master node.

  • The configuration only needs to be updated on the master node, though, as the config is packed up and shipped whatever node ends up running the driver.

wrschneider
  • 17,913
  • 16
  • 96
  • 176
  • where do you get that `script-runner.jar`, is using add step on emr cluster similar to the step you configured there ? – billie class Jul 20 '22 at 15:30
0

In case you're using python in your EMR cluster there's no need for you to specify the jar while creating the cluster. You can add the jar package while creating your SparkSession.

    spark = SparkSession \
    .builder \
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.0") \
    .config("spark.jars.packages", "mysql:mysql-connector-java:8.0.17") \
    .getOrCreate()

And then when you make your query mention the driver like this:

 form_df = spark.read.format("jdbc"). \
 option("url", "jdbc:mysql://yourdatabase"). \
 option("driver", "com.mysql.jdbc.Driver"). \

This way the package is included on the SparkSession as it is pulled from a maven repository. I hope it helps someone that is on the same situation I once was.

Eric Aya
  • 69,473
  • 35
  • 181
  • 253