How to install a postgresql JDBC driver in pyspark

Question

I use pyspark with spark 2.2.0 on a lubuntu 16.04 and I want to write a Dataframe to my Postgresql database. Now as far as I understand it I have to install a jdbc driver on the spark master for it. I downloaded the postgresql jdbc driver from their website and tried to follow this post. I added spark.jars.packages /path/to/driver/postgresql-42.2.1.jar to spark-default.conf with the only result that pyspark no longer launches.

I'm kinda lost in java land for one I don't know if this is the right format.The documentation tells me I should add a list but I don't know how a path list is supposed to look like. Then I don't know if I also have to specify spark.jars and or spark.driver.extraClassPath or if spark.jars.packages is enough? And if i have to add them what kind of format are they?

score 2 · Accepted Answer · answered Feb 23 '18 at 13:01

2

spark.jars.packages is for dependencies that can be pulled from Maven (think it as pip for Java, although the analogy is probably kinda loose).

You can submit your job with the option --jars /path/to/driver/postgresql-42.2.1.jar, so that the submission will also provide the library, that the cluster manager will distribute on all worker nodes on your behalf.

If you want to set this as a configuration you can use the spark.jars key instead of spark.jars.packages. The latter requires Maven coordinates, rather then a path (which is probably the reason why your job is failing).

You can read more about the configuration keys I introduced on the official documentation.

answered Feb 23 '18 at 13:01

stefanobaghino

11,253
4
35
63

do I still have to tell spark that this is the driver to use? I now run the application with the `--jars /path/to/jar` argument and in the application environment I see the entry: `spark://myip:51810/jars/postgresql-42.2.1.jar Added By User` but when I run the code I get the `Py4JJavaError: An error occurred while calling o47.jdbc. : java.sql.SQLException: No suitable driver` exception – Thagor Feb 23 '18 at 14:42
2

Probably so, try adding `properties={"driver": 'org.postgresql.Driver'}` or `.option("driver", 'org.postgresql.Driver')` (shamelessly copied - and edited - from https://stackoverflow.com/a/37422017/3314107) – stefanobaghino Feb 23 '18 at 14:51

How to install a postgresql JDBC driver in pyspark

1 Answers1