I have a potentially stupid question; I actually fixed this issue when running Spark locally but haven't been able to resolve it when running it on AWS EMR.
Basically, I have a pyspark script which I submit that reads in data, manipulates it, processes it into a Spark Dataframe and writes it into a MySQL table that I have already hosted elsewhere on AWS RDS.
This is EMR 5.6, with Spark 2.1.1
I downloaded the latest drivers for MySQL connector ("mysql-connector-java-5.1.42-bin.jar") and put them into my instance with the Master Node (basically downloaded it onto my local laptop and then used scp to put it in the master node).
I then found my spark-defaults.conf file under /etc/spark/conf and edited the following parameters:
spark.driver.extraClassPath
spark.executor.extraClassPath
To both of these, I added the path to my mysql-connector file, which was found at /home/hadoop/mysql-connector-java-5.1.42-bin.jar
Based on this SO post (Adding JDBC driver to Spark on EMR), I use the following command to submit (included the entire path from "extraClassPath"):
spark-submit sample_script.py --driver-class-path /home/hadoop/mysql-connector-java-5.1.42-bin.jar:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*
In my code, I have a spark dataframe and the following code is what writes to the database:
SQL_CONN = "jdbc:mysql://name.address.amazonaws.com:8000/dbname?user=user&password=pwd"
spark_df.write.jdbc(SQL_CONN, table="tablename", mode="append", properties={"driver":'com.mysql.jdbc.Driver'})
The specific error I get is this:
java.lang.ClassNotFoundException (com.mysql.jdbc.Driver) [duplicate 51]
Any input would be appreciated... this feels like a really stupid mistake on my part that I an unable to pinpoint.