4

I am trying to launch an EMR cluster that includes the Postgres driver JAR file so I can load data from Postgres and analyze it with PySpark. I have the the JAR I would like to include stored in S3. I have tried the following things:

1 - entering the following configuration:

[
  {
    "Classification": "presto-connector-postgresql",
    "Properties": {
      "connection-url": "jdbc:postgresql://example.net:5432/database",
      "connection-user": "MYUSER",
      "connection-password": "MYPASS"
    },
    "Configurations": []
  }
]

2 - Adding the JAR as a custom step (selecting the JAR from S3)

3 - Adding the JAR as a custom bootstrap action (selecting the JAR from S3)

None of these work, I can't figure out how to use the connector in Step 1 within Jupyter, and the custom step/bootstrap action both fail when I launch the cluster. How can I launch an EMR cluster with the Postgres drivers installed so I can query my data in Jupyter?

EDIT:

I used the following bootstrap script to copy the JAR to my master/worker nodes:

#!/bin/bash
aws s3 cp s3://BUCKETNAME/postgresql-42.2.8.jar /mnt1/myfolder

But still get the following error:

An error was encountered:
An error occurred while calling o90.load.
: java.lang.ClassNotFoundException: org.postgresql.Driver

With the following code:

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbcURL") \
    .option("user", "user") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .option("query", "select * from slm_files limit 100") \
    .load()

df.count()
DBA108642
  • 1,995
  • 1
  • 18
  • 55

1 Answers1

7

Using this code in the first cell of my Jupyter notebook solved it for me:

%%configure -f
{ "conf":{
          "spark.jars": "s3://JAR-LOCATION/postgresql-42.2.8.jar"
         }
}

DBA108642
  • 1,995
  • 1
  • 18
  • 55
  • Thanks Man, I was stuck at this from long time. It was a saviour – Rajnish kumar Jun 24 '21 at 11:44
  • This worked for Postgres. Do you have any idea how to do it for Redshift. I am getting same error for redshift – Rajnish kumar Jun 24 '21 at 12:31
  • Try adding the redshift jar to that location in S3, and make the `spark.jars` value a list that contains both. though this does work for me with redshift too because it uses Postgres syntax – DBA108642 Jun 24 '21 at 13:41
  • Not able to add both in `spark.jars` in list format. It is accepting only one at a time. But it worked other way. I directly copied the jar to EMR cluster and it is picking up directly from there. – Rajnish kumar Jun 24 '21 at 14:17