1

Please read carrefully this is not a duplication of this.

I am trying to access an RDS database via EMR on AWS. I did this on Zeppelin:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/home/hadoop/postgresql-42.2.18.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://host:5432/base") \
    .option("dbtable", "tab") \
    .option("user", "xx") \
    .option("password", "xx") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

When I executed I got this error :

java.lang.ClassNotFoundException: org.postgresql.Driver 

So I found this and it worked fine as I can see my table's schema (I added the spark.driver.extraClassPath variable in my interpreter configurations):

root
 |-- domaine: string (nullable = true)
 |-- traitement: string (nullable = true)
 |-- parquet: string (nullable = true)
 |-- status: string (nullable = true)
 |-- date: date (nullable = true) 

BUT, when I try to do df.show() to see the content of the table it returns the same error from before:

Py4JJavaError: An error occurred while calling o118.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, host, executor 1): java.lang.ClassNotFoundException: org.postgresql.Driver
    at org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:124)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)

I dont understand why am I able to see the schema but not the content.

Any help? Thank you.

Haha
  • 973
  • 16
  • 43

2 Answers2

0

When you submit a hadoop job application master can get created on any of your worker node including master node (depending on your configuration). If you are using EMR, your application master by default gets created on any of your worker node (CORE node) but not on master.

Your file exists on /home/hadoop/postgresql-42.2.18.jar I'm assuming that means on master node.
Your program will search for this file on that node where the application master is and its definitely not on your master node because for that you wouldn’t get any error.


To solve this issue:

  1. Put the jar in every core node (Not a scalable solution)
  2. Use hdfs://. Putting it in HDFS it much better option. Here HDFS is shared with all CORE & TASK instance.
  3. Same as HDFS instead S3 and fetch it via EMRFS (s3://)

PS: I don't know how you are able to see schema with spark.driver.extraClassPath

Snigdhajyoti
  • 1,327
  • 10
  • 26
0

I resolved this problem by adding the following parameters in the Zeppelin interpreter menu:

spark.driver.extraClassPath=/home/hadoop/postgresql-42.2.18.jar 
spark.jars.packages=org.postgresql:postgresql:42.2.18
Haha
  • 973
  • 16
  • 43