3

I have existing EMR cluster running and wish to create DF from Postgresql DB source.

To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.

Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?

I tried the following:

  1. Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)

  2. Edited spark-default.conf to include wildcard location

    spark.driver.extraClassPath  :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
    
  3. Tried to create dataframe in Jupyter cell using the following code:

    SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
    spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
    

I get a Java error as per below:

Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver

Help appreciated.

Community
  • 1
  • 1
Tex
  • 89
  • 2
  • 10
  • The driver should be both on driver and `executor` (`spark.executor.extraClassPath`). Also the version of the driver looks funky - it is fairly unlikely you use JRE 6.. – 10465355 Jan 12 '19 at 15:31
  • Does the master/cluster need to be restarted to 'read-in' the spark-defaults.conf file? Or will this be checked dynamically when the JDBC call is made? – Tex Jan 12 '19 at 18:13
  • Application has to be, cluster no. Also not [the point](https://stackoverflow.com/a/54161463/10465355) made by [@afaq](https://stackoverflow.com/users/1733823/afaq) – 10465355 Jan 14 '19 at 10:58

2 Answers2

4

I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:

Download postgres driver jar:

cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar

Create dataframe:

atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
        .format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
                 'database' : <db>,
                 'dbtable' : <select * from table>}
 df=spark.read.format('jdbc').options(**attribute).load()

Submit to spark job: Add the the downloaded jar to driver class path while submitting the spark job.

--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5 
Md Shihab Uddin
  • 541
  • 5
  • 13
  • Thanks - this worked for me. Next step is to try hide the JDBC access credentials using environmental variables. Seems difficult to find this documentation though. – Tex Jan 21 '19 at 12:29
0

Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

Afaq
  • 1,146
  • 1
  • 13
  • 25