Using pyspark to connect to PostgreSQL

Question

I am trying to connect to a database with pyspark and I am using the following code:

sqlctx = SQLContext(sc)
df = sqlctx.load(
    url = "jdbc:postgresql://[hostname]/[database]",
    dbtable = "(SELECT * FROM talent LIMIT 1000) as blah",
    password = "MichaelJordan",
    user =  "ScottyPippen",
    source = "jdbc",
    driver = "org.postgresql.Driver"
)

and I am getting the following error:

Any idea why is this happening?

Edit: I am trying to run the code locally in my computer.

Rafael · Answer 1 · 2022-10-18T07:47:25.520

40

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download/

Then replace the database configuration values by yours.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.jars", "/path_to_postgresDriver/postgresql-42.2.5.jar") \
    .getOrCreate()

df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/databasename") \
    .option("dbtable", "tablename") \
    .option("user", "username") \
    .option("password", "password") \
    .option("driver", "org.postgresql.Driver") \
    .load()

df.printSchema()

More info: https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

edited Oct 18 '22 at 07:47

answered May 23 '19 at 16:15

Rafael

601
7
10

1

This one works! I was running my code in jupyter lab. I tried stoping and restarting `spark` session, but it didnt load. When I restarted jupyter kernel, it worked! – Thamme Gowda Jul 13 '20 at 03:32
4

It doesn't work for me, it still gives "java.lang.ClassNotFoundException: org.postgresql.Driver" Any idea? – SudipM Nov 13 '20 at 12:18
For me it gives the following error: `: org.postgresql.util.PSQLException: Connection to localhost:5432 refused. Check that the hostname and port are correct and that the postmaster is accepting TCP/IP connections.` My postgres db has a name. And i dont see an option to include that in the syntax. However when i use the `psycopg2` library it works. Is there any way I can make the spark connect directly? – thentangler Sep 28 '21 at 13:52
Can the url, usrname and pass be added to the config of the SparkSession so that spark can let me just do `spark.sql('select * from db.tblname limit 10')` ? – sAguinaga Sep 09 '22 at 17:36
1

The url changed to https://jdbc.postgresql.org/download/, could you please update it? – exyi Sep 28 '22 at 16:35

mdh · Answer 2 · 2016-05-05T20:10:20.987

The following worked for me with postgres on localhost:

Download the PostgreSQL JDBC Driver from https://jdbc.postgresql.org/download.html.

For the pyspark shell you use the SPARK_CLASSPATH environment variable:

$ export SPARK_CLASSPATH=/path/to/downloaded/jar
$ pyspark

For submitting a script via spark-submit use the --driver-class-path flag:

$ spark-submit --driver-class-path /path/to/downloaded/jar script.py

In the python script load the tables as a DataFrame as follows:

from pyspark.sql import DataFrameReader

url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = DataFrameReader(sqlContext).jdbc(
    url='jdbc:%s' % url, table='tablename', properties=properties
)

or alternatively:

df = sqlContext.read.format('jdbc').\
    options(url='jdbc:%s' % url, dbtable='tablename').\
    load()

Note that when submitting the script via spark-submit, you need to define the sqlContext.

Chevelle · Answer 3 · 2017-12-10T20:59:48.627

6

It is necesary copy postgresql-42.1.4.jar in all nodes... for my case, I did copy in the path /opt/spark-2.2.0-bin-hadoop2.7/jars

Also, i set classpath in ~/.bashrc (export SPARK_CLASSPATH="/opt/spark-2.2.0-bin-hadoop2.7/jars" )

and work fine in pyspark console and jupyter

edited Dec 10 '17 at 20:59

answered Nov 13 '17 at 16:20

Chevelle

248
5
13

7

I'm looking for how to do this in Amazon EMR. It used to work for me just setting spark.jar=/path/to/driver but after EMR 5.2.1 (which was Spark version 2.0.2) I can't get it to work anymore. – Evan Zamir Mar 13 '18 at 18:18

score 6 · Answer 4 · answered Apr 22 '20 at 08:22

6

To use pyspark and jupyter notebook notebook: first open pyspark with

pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar  --jars /spark_drivers/postgresql-42.2.12.jar

Then in jupyter notebook

import os
jardrv = "~/spark_drivers/postgresql-42.2.12.jar"


from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', jardrv).getOrCreate()
url = 'jdbc:postgresql://127.0.0.1/dbname'
properties = {'user': 'usr', 'password': 'pswd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

answered Apr 22 '20 at 08:22

Galuoises

2,630
24
30

1

FWIW, I was not able to get any of the below solutions to work without specifying --driver-class-path and --jars directly to pyspark. This is the correct answer – Aki Nov 06 '20 at 15:01
I could not get it to work on Zeppelin notebook on AWS, but it worked when i ssh'ed into the master EC2 node and used pyspark --driver-class-path /spark_drivers/postgresql-42.2.12.jar --jars /spark_drivers/postgresql-42.2.12.jar. Thanks – Robin Jun 15 '21 at 19:09

score 4 · Answer 5 · edited May 23 '17 at 11:47

4

You normally need either:

to install the Postgres Driver on your cluster,
to provide the Postgres driver jar from your client with the --jars option
or to provide the maven coordinates of the Postgres driver with --packages option.

If you detail how are you launching pyspark, we may give you more details.

Some clues/ideas:

spark-cannot-find-the-postgres-jdbc-driver

Not able to connect to postgres using jdbc in pyspark shell

edited May 23 '17 at 11:47

Community

1
1

answered Jan 22 '16 at 14:47

MiguelPeralvo

837
1
11
19

1

What if I am trying to run it locally? Do I need to download the postgres driver? and where should I store it? – Mpizos Dimitris Jan 22 '16 at 15:31
Yes, you need it. Because you're acting as a Postgres client and you're specifying that you're going to use the Postgres driver in the "driver = "org.postgresql.Driver"" option. You can store it wherever you want in your local computer (for example jre\lib\ext of your java installation) and specify the storage path in the CLASSPATH. – MiguelPeralvo Jan 22 '16 at 15:42

Derek Hill · Answer 6 · 2020-01-30T07:34:18.110

One approach, building on the example per the quick start guide, is this blog post which shows how to add the --packages org.postgresql:postgresql:9.4.1211 argument to the spark-submit command.

This downloads the driver into ~/.ivy2/jars directory, in my case /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar. Passing this as the --driver-class-path option gives the full spark-submit command of:

/usr/local/Cellar/apache-spark/2.0.2/bin/spark-submit\
 --packages org.postgresql:postgresql:9.4.1211\
 --driver-class-path /Users/derekhill/.ivy2/jars/org.postgresql_postgresql-9.4.1211.jar\
 --master local[4] main.py

And in main.py:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

dataframe = spark.read.format('jdbc').options(
        url = "jdbc:postgresql://localhost/my_db?user=derekhill&password=''",
        database='my_db',
        dbtable='my_table'
    ).load()

dataframe.show()

score 2 · Answer 7 · answered May 07 '20 at 16:54

I had trouble to get a connection to the postgresDB with the jars i had on my computer. This code solved my problem with the driver

 from pyspark.sql import SparkSession
 import os

 sparkClassPath = os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.postgresql:postgresql:42.1.1 pyspark-shell'

 spark = SparkSession \
    .builder \
    .config("spark.driver.extraClassPath", sparkClassPath) \
    .getOrCreate()

 df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/yourDBname") \
    .option("driver", "org.postgresql.Driver") \
    .option("dbtable", "yourtablename") \
    .option("user", "postgres") \
    .option("password", "***") \
    .load()

df.show()

Many many thanks! I was using the config approach too and had a good while to figure out that ```.option("driver", "org.postgresql.Driver") \```is mandatory for that approach. — RndmSymbl, Oct 02 '21 at 20:05

score 2 · Answer 8 · edited Jul 04 '21 at 22:19

I also get this error

java.sql.SQLException: No suitable driver
 at java.sql.DriverManager.getDriver(Unknown Source)

and add one item .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') in SparkSession - that worked.

eg:

from pyspark import SparkContext, SparkConf
import os
from pyspark.sql.session import SparkSession

spark = SparkSession \
    .builder \
    .appName('Python Spark Postgresql') \
    .config("spark.jars", "./postgresql-42.2.18.jar") \
    .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') \
    .getOrCreate()


df = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql://localhost:5432/abc") \
    .option("dbtable", 'tablename') \
    .option("user", "postgres") \
    .option("password", "1") \
    .load()

df.printSchema()

I tried using .config('spark.driver.extraClassPath', './postgresql-42.2.18.jar') and it still did not work. — Robin, Jun 15 '21 at 18:02

score 1 · Answer 9 · answered Jan 22 '16 at 14:51

1

This exception means jdbc driver does not in driver classpath. you can spark-submit jdbc jars with --jar parameter, also add it into driver classpath using spark.driver.extraClassPath.

answered Jan 22 '16 at 14:51

Shawn Guo

3,169
3
21
28

score 1 · Answer 10 · answered Jun 30 '21 at 13:12

1

Download postgresql jar from here:
Add this to ~Spark/jars/ folder.
Restart your kernel. It should work.

answered Jun 30 '21 at 13:12

Vishal R

1,279
1
21
27

score 0 · Answer 11 · edited Sep 22 '20 at 20:25

Just initialize pyspark with --jars <path/to/your/jdbc.jar>

E.g.: pyspark --jars /path/Downloads/postgresql-42.2.16.jar

then create a dataframe as suggested above in other answers

E.g.:

df2 = spark.read.format("jdbc").option("url", "jdbc:postgresql://localhost:5432/db").option("dbtable", "yourTableHere").option("user", "postgres").option("password", "postgres").option("driver", "org.postgresql.Driver").load()

score 0 · Answer 12 · answered Apr 12 '22 at 07:05

Download postgres JDBC driver from https://jdbc.postgresql.org/download.html and use the script below. Changes to make:

Edit PATH_TO_JAR_FILE
Save your DB credentials in an environment file and load them
Query the DB using query option and limit using fetch size

import os
from pyspark.sql import SparkSession

PATH_TO_JAR_FILE = "/home/user/Downloads/postgresql-42.3.3.jar"
spark = SparkSession \
    .builder \
    .appName("Example") \
    .config("spark.jars", PATH_TO_JAR_FILE) \
    .getOrCreate()

DB_HOST = os.environ.get("PG_HOST")
DB_PORT = os.environ.get("PG_PORT")
DB_NAME = os.environ.get("PG_DB_CLEAN")
DB_PASSWORD = os.environ.get("PG_PASSWORD")
DB_USER = os.environ.get("PG_USERNAME")

df = spark.read \
    .format("jdbc") \
    .option("url", f"jdbc:postgresql://{DB_HOST}:{DB_PORT}/{DB_NAME}") \
    .option("user", DB_USER) \
    .option("password", DB_PASSWORD) \
    .option("driver", "org.postgresql.Driver") \
    .option("query","select * from your_table") \
    .option('fetchsize',"1000") \
    .load()

df.printSchema()

Using pyspark to connect to PostgreSQL

12 Answers12

Linked