That’s pretty straightforward. To connect to external database to retrieve data into Spark dataframes, an additional jar
file is required.
E.g. with MySQL the JDBC driver is required. Download the driver package and extract mysql-connector-java-x.yy.zz-bin.jar
in a path that’s accessible from every node in the cluster. Preferably this is a path on shared file system.
E.g. with Pouta Virtual Cluster such path would be under /shared_data
, here I use /shared_data/thirdparty_jars/
.
With direct Spark job submissions from terminal one can specify –driver-class-path
argument pointing to extra jars that should be provided to workers with the job. However this does not work with this approach, so we must configure these paths for front end and worker nodes in the spark-defaults.conf
file, usually in /opt/spark/conf
directory.
Place Any jar
that depends upon what server you using in:
spark.driver.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar
spark.executor.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar