Connect Amazon EMR Spark with MySQL (writing data)

Question

I have a potentially stupid question; I actually fixed this issue when running Spark locally but haven't been able to resolve it when running it on AWS EMR.

Basically, I have a pyspark script which I submit that reads in data, manipulates it, processes it into a Spark Dataframe and writes it into a MySQL table that I have already hosted elsewhere on AWS RDS.

This is EMR 5.6, with Spark 2.1.1

I downloaded the latest drivers for MySQL connector ("mysql-connector-java-5.1.42-bin.jar") and put them into my instance with the Master Node (basically downloaded it onto my local laptop and then used scp to put it in the master node).

I then found my spark-defaults.conf file under /etc/spark/conf and edited the following parameters:

spark.driver.extraClassPath
spark.executor.extraClassPath

To both of these, I added the path to my mysql-connector file, which was found at /home/hadoop/mysql-connector-java-5.1.42-bin.jar

Based on this SO post (Adding JDBC driver to Spark on EMR), I use the following command to submit (included the entire path from "extraClassPath"):

spark-submit sample_script.py --driver-class-path /home/hadoop/mysql-connector-java-5.1.42-bin.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*

In my code, I have a spark dataframe and the following code is what writes to the database:

SQL_CONN = "jdbc:mysql://name.address.amazonaws.com:8000/dbname?user=user&password=pwd"
spark_df.write.jdbc(SQL_CONN, table="tablename", mode="append", properties={"driver":'com.mysql.jdbc.Driver'})

The specific error I get is this:

java.lang.ClassNotFoundException (com.mysql.jdbc.Driver) [duplicate 51]

Any input would be appreciated... this feels like a really stupid mistake on my part that I an unable to pinpoint.

`{"driver":'com.mysql.jdbc.Driver'}` shouldn't it be double quotes here? Not really related to the problem. I would check first if the user that are running the `spark-submit` has access to the given classpaths paths, later I would check in the logs (wherever they are) if those classpaths was really loaded — Jorge Campos, Jun 28 '17 at 04:57
I think it's actually bc my slave nodes don't have the jar file. Let me try real quick. — shishy, Jun 28 '17 at 04:57

score 5 · Accepted Answer · answered Jun 28 '17 at 05:29

Fixed - I was stupid and forgot to put the jar file in my slave nodes as well. I forgot that --driver-class-path doesn't automatically distribute the jar to my slaves.

It worked once I put the jar file in the same root directory as it was in my master node (i.e. /home/hadoop in my case).

Hope this helps.

score 4 · Answer 2 · answered Aug 31 '20 at 05:36

Although the answer by author is correct, but instead of putting jar manually, you can use --jars to submit a jar and it will handle rest for you

spark-submit  --jars /home/hadoop/mysql-connector-java-5.1.42-bin.jar sample-script.py

Although not asked explicitly, but in EMR notebook, since you dont want to run spark-submit yourself, there is more easier way

Upload the jar file to s3, Let this be the first cell of the notebook

%%configure -f
{
    "conf": {
        "spark.jars": "s3://jar-test/mysql-connector-java-5.1.42-bin.jar"        
    }
}

Connect Amazon EMR Spark with MySQL (writing data)

2 Answers2