32

I am trying to connect to a database with pyspark and I am using the following code:

sqlctx = SQLContext(sc)
df = sqlctx.load(
    url = "jdbc:postgresql://[hostname]/[database]",
    dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
    password = "MichaelJordan",
    user =  "ScottyPippen",
    source = "jdbc",
    driver = "org.postgresql.Driver"
)

and I am getting the following error:

enter image description here

Any idea why is this happening?

Edit: I am trying to run the code locally in my computer.

Community
  • 1
  • 1
Mpizos Dimitris
  • 4,819
  • 12
  • 58
  • 100

12 Answers12

40

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download/

Then replace the database configuration values by yours.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/databasename") \
    .option("dbtable", "tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

Rafael
  • 601
  • 7
  • 10
  • 1
    This one works! I was running my code in jupyter lab. I tried stoping and restarting `spark` session, but it didnt load. When I restarted jupyter kernel, it worked! – Thamme Gowda Jul 13 '20 at 03:32
  • 4
    It doesn't work for me, it still gives "java.lang.ClassNotFoundException: org.postgresql.Driver" Any idea? – SudipM Nov 13 '20 at 12:18
  • For me it gives the following error: `: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.` My postgres db has a name. And i dont see an option to include that in the syntax. However when i use the `psycopg2` library it works. Is there any way I can make the spark connect directly? – thentangler Sep 28 '21 at 13:52
  • Can the url, usrname and pass be added to the config of the SparkSession so that spark can let me just do `spark.sql('select * from db.tblname limit 10')` ? – sAguinaga Sep 09 '22 at 17:36
  • 1
    The url changed to https://jdbc.postgresql.org/download/, could you please update it? – exyi Sep 28 '22 at 16:35
18

The following worked for me with postgres on localhost:

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.

For the pyspark shell you use the SPARK_CLASSPATH environment variable:

$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark

For submitting a script via spark-submit use the --driver-class-path flag:

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

In the python script load the tables as a DataFrame as follows:

from pyspark.sql import DataFrameReader

url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
    url='jdbc:%s' % url, table='tablename', properties=properties
)

or alternatively:

df = sqlContext.read.format('jdbc').\
    options(url='jdbc:%s' % url, dbtable='tablename').\
    load()

Note that when submitting the script via spark-submit, you need to define the sqlContext.

mdh
  • 5,355
  • 5
  • 26
  • 33
6

It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars

Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )

and work fine in pyspark console and jupyter

Chevelle
  • 248
  • 5
  • 13
  • 7
    I'm looking for how to do this in Amazon EMR. It used to work for me just setting spark.jar=/path/to/driver but after EMR 5.2.1 (which was Spark version 2.0.2) I can't get it to work anymore. – Evan Zamir Mar 13 '18 at 18:18
6

To use pyspark and jupyter notebook notebook: first open pyspark with

pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar  --jars /spark_drivers/postgresql-42.2.12.jar

Then in jupyter notebook

import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"


from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)
Galuoises
  • 2,630
  • 24
  • 30
  • 1
    FWIW, I was not able to get any of the below solutions to work without specifying --driver-class-path and --jars directly to pyspark. This is the correct answer – Aki Nov 06 '20 at 15:01
  • I could not get it to work on Zeppelin notebook on AWS, but it worked when i ssh'ed into the master EC2 node and used pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar. Thanks – Robin Jun 15 '21 at 19:09
4

You normally need either:

  1. to install the Postgres Driver on your cluster,
  2. to provide the Postgres driver jar from your client with the --jars option
  3. or to provide the maven coordinates of the Postgres driver with --packages option.

If you detail how are you launching pyspark, we may give you more details.

Some clues/ideas:

spark-cannot-find-the-postgres-jdbc-driver

Not able to connect to postgres using jdbc in pyspark shell

Community
  • 1
  • 1
MiguelPeralvo
  • 837
  • 1
  • 11
  • 19
  • 1
    What if I am trying to run it locally? Do I need to download the postgres driver? and where should I store it? – Mpizos Dimitris Jan 22 '16 at 15:31
  • Yes, you need it. Because you're acting as a Postgres client and you're specifying that you're going to use the Postgres driver in the "driver = "org.postgresql.Driver"" option. You can store it wherever you want in your local computer (for example jre\lib\ext of your java installation) and specify the storage path in the CLASSPATH. – MiguelPeralvo Jan 22 '16 at 15:42
3

One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.

This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:

/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
 --packages org.postgresql:postgresql:9.4.1211\
 --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
 --master local[4] main.py

And in main.py:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

dataframe = spark.read.format('jdbc').options(
        url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
        database='my_db',
        dbtable='my_table'
    ).load()

dataframe.show()
Derek Hill
  • 5,965
  • 5
  • 55
  • 74
2

I had trouble to get a connection to the postgresDB with the jars i had on my computer. This code solved my problem with the driver

 from pyspark.sql import SparkSession
 import os

 sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'

 spark = SparkSession \
    .builder \
    .config("spark.driver.extraClassPath", sparkClassPath) \
    .getOrCreate()

 df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "yourtablename") \
    .option("user", "postgres") \
    .option("password", "***") \
    .load()

df.show()
Florian Uhlmann
  • 105
  • 1
  • 2
  • 8
  • Many many thanks! I was using the config approach too and had a good while to figure out that ```.option("driver", "org.postgresql.Driver") \```is mandatory for that approach. – RndmSymbl Oct 02 '21 at 20:05
2

I also get this error

java.sql.SQLException: No suitable driver
 at java.sql.DriverManager.getDriver(Unknown Source)

and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') in SparkSession - that worked.

eg:

from pyspark import SparkContext, SparkConf
import os
from pyspark.sql.session import SparkSession

spark = SparkSession \
    .builder \
    .appName('Python Spark Postgresql') \
    .config("spark.jars", "./postgresql-42.2.18.jar") \
    .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
    .getOrCreate()


df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/abc") \
    .option("dbtable", 'tablename') \
    .option("user", "postgres") \
    .option("password", "1") \
    .load()

df.printSchema()
Nishu Tayal
  • 20,106
  • 8
  • 49
  • 101
Finch Xu
  • 21
  • 1
  • I tried using .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') and it still did not work. – Robin Jun 15 '21 at 18:02
1

This exception means jdbc driver does not in driver classpath. you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.

Shawn Guo
  • 3,169
  • 3
  • 21
  • 28
1
  1. Download postgresql jar from here:
  2. Add this to ~Spark/jars/ folder.
  3. Restart your kernel. It should work.
Vishal R
  • 1,279
  • 1
  • 21
  • 27
0

Just initialize pyspark with --jars <path/to/your/jdbc.jar>

E.g.: pyspark --jars /path/Downloads/postgresql-42.2.16.jar

then create a dataframe as suggested above in other answers

E.g.:

df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()
Petr Hejda
  • 40,554
  • 8
  • 72
  • 100
musiceni
  • 21
  • 4
0

Download postgres JDBC driver from https://jdbc.postgresql.org/download.html and use the script below. Changes to make:

  1. Edit PATH_TO_JAR_FILE
  2. Save your DB credentials in an environment file and load them
  3. Query the DB using query option and limit using fetch size
import os
from pyspark.sql import SparkSession

PATH_TO_JAR_FILE = "/home/user/Downloads/postgresql-42.3.3.jar"
spark = SparkSession \
    .builder \
    .appName("Example") \
    .config("spark.jars", PATH_TO_JAR_FILE) \
    .getOrCreate()

DB_HOST = os.environ.get("PG_HOST")
DB_PORT = os.environ.get("PG_PORT")
DB_NAME = os.environ.get("PG_DB_CLEAN")
DB_PASSWORD = os.environ.get("PG_PASSWORD")
DB_USER = os.environ.get("PG_USERNAME")

df = spark.read \
    .format("jdbc") \
    .option("url", f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}") \
    .option("user", DB_USER) \
    .option("password", DB_PASSWORD) \
    .option("driver", "org.postgresql.Driver") \
    .option("query","select * from your_table") \
    .option('fetchsize',"1000") \
    .load()

df.printSchema()