Using Postgresql JDBC source with Apache Spark on EMR

Question

I have existing EMR cluster running and wish to create DF from Postgresql DB source.

To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.

Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?

I tried the following:

Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)

Edited spark-default.conf to include wildcard location

spark.driver.extraClassPath  :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$

Tried to create dataframe in Jupyter cell using the following code:

SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})

I get a Java error as per below:

Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver

Help appreciated.

The driver should be both on driver and `executor` (`spark.executor.extraClassPath`). Also the version of the driver looks funky - it is fairly unlikely you use JRE 6.. — 10465355, Jan 12 '19 at 15:31
Does the master/cluster need to be restarted to 'read-in' the spark-defaults.conf file? Or will this be checked dynamically when the JDBC call is made? — Tex, Jan 12 '19 at 18:13
Application has to be, cluster no. Also not [the point](https://stackoverflow.com/a/54161463/10465355) made by [@afaq](https://stackoverflow.com/users/1733823/afaq) — 10465355, Jan 14 '19 at 10:58

score 4 · Accepted Answer · answered Jan 14 '19 at 19:23

I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:

Download postgres driver jar:

cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar

Create dataframe:

atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
        .format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
                 'database' : <db>,
                 'dbtable' : <select * from table>}
 df=spark.read.format('jdbc').options(**attribute).load()

Submit to spark job: Add the the downloaded jar to driver class path while submitting the spark job.

--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5

Thanks - this worked for me. Next step is to try hide the JDBC access credentials using environmental variables. Seems difficult to find this documentation though. — Tex, Jan 21 '19 at 12:29

score 0 · Answer 2 · answered Jan 12 '19 at 16:19

0

Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

answered Jan 12 '19 at 16:19

Afaq

1,146
1
13
25

1

Might be better to link to the documentation: https://jdbc.postgresql.org/documentation/head/load.html – Mark Rotteveel Jan 14 '19 at 19:27

Using Postgresql JDBC source with Apache Spark on EMR

2 Answers2

Linked