12

I have asked this question previously also but did not got any answer (Not able to connect to postgres using jdbc in pyspark shell).

I have successfully installed Spark 1.3.0 on my local windows and ran sample programs to test using pyspark shell.

Now, I want to run Correlations from Mllib on the data that is stored in Postgresql, but I am not able to connect to postgresql.

I have successfully added the required jar (tested this jar) in the classpath by running

pyspark --jars "C:\path\to\jar\postgresql-9.2-1002.jdbc3.jar"

I can see that jar is successfully added in environment UI.

When I run the following in pyspark shell-

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")  

I get this ERROR -

>>> df = sqlContext.load(source="jdbc",url="jdbc:postgresql://[host]/[dbname]", dbtable="[schema.table]")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\pyspark\sql\context.py", line 482, in load
    df = self._ssql_ctx.load(source, joptions)
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\java_gateway.py", line 538, in __call__
  File "C:\Users\ACERNEW3\Desktop\Spark\spark-1.3.0-bin-hadoop2.4\python\lib\py4j-0.8.2.1-src.zip\py4j\protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o20.load.
: java.sql.SQLException: No suitable driver found for     jdbc:postgresql://[host]/[dbname]
        at java.sql.DriverManager.getConnection(DriverManager.java:602)
        at java.sql.DriverManager.getConnection(DriverManager.java:207)
        at org.apache.spark.sql.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:94)
        at org.apache.spark.sql.jdbc.JDBCRelation.<init>    (JDBCRelation.scala:125)
        at  org.apache.spark.sql.jdbc.DefaultSource.createRelation(JDBCRelation.scala:114)
        at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:290)
        at org.apache.spark.sql.SQLContext.load(SQLContext.scala:679)
        at org.apache.spark.sql.SQLContext.load(SQLContext.scala:667)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
        at py4j.Gateway.invoke(Gateway.java:259)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:207)
        at java.lang.Thread.run(Thread.java:619)
Community
  • 1
  • 1
Soni Shashank
  • 221
  • 1
  • 3
  • 9
  • possible duplicate of [Not able to connect to postgres using jdbc in pyspark shell](http://stackoverflow.com/questions/29669420/not-able-to-connect-to-postgres-using-jdbc-in-pyspark-shell) – Mark Rotteveel Apr 23 '15 at 11:38
  • definitely duplicate! it's been posted by the same user! – eliasah Apr 23 '15 at 11:49
  • Yes it is duplicate.. this one is more detailed version. Please help me out here. – Soni Shashank Apr 23 '15 at 12:00
  • You shouldn't post duplicates. If anything you should update the older version to include the additional information. – Mark Rotteveel Apr 23 '15 at 12:25
  • 4
    Okay.. noted, will not do this again... But apart from downvoting and reminding me that I should not do this, can anyone help me out here. – Soni Shashank Apr 23 '15 at 12:28
  • One suggestion: use a newer driver (specifically JDBC 4 or 4.1) maybe the problem is simply that the driver hasn't been loaded (driver autoloading was added in JDBC 4). Another method would be to explicitly load the class (using `Class.forName`) – Mark Rotteveel Apr 23 '15 at 13:57
  • Thanks mark for the suggestion.. I tried JDBC 3,4 & 4.1 drivers one by one but nothing is working... Can you please help me on how can I load the class explicitly in python.. – Soni Shashank Apr 24 '15 at 10:47

6 Answers6

13

I had this exact problem with mysql/mariadb, and got BIG clue from this question

So your pyspark command should be:

pyspark --conf spark.executor.extraClassPath=<jdbc.jar> --driver-class-path <jdbc.jar> --jars <jdbc.jar> --master <master-URL>

Also watch for errors when pyspark start like "Warning: Local jar ... does not exist, skipping." and "ERROR SparkContext: Jar not found at ...", these probably mean you spelled the path wrong.

Community
  • 1
  • 1
8forty
  • 545
  • 4
  • 13
  • Hey, Can you please explain various flags? I am new to Spark and struggling a lot to get it work with phoenix table. does master mean the host address of phoenix? where can i find jdbc.jar? – pg2455 Sep 29 '15 at 17:33
  • There's comprehensive documentation of all flags in the Spark docs. The current version's "Configuration" page (documenting these flags and many more) is here: http://spark.apache.org/docs/latest/configuration.html – 8forty Sep 30 '15 at 19:42
  • 1
    It works! Only need --driver-class-path in spark 1.6.1 – giaosudau Oct 05 '16 at 10:24
  • You need --jars if running on a cluster, so that every node gets the jdbc jar (unless it was installed manually on every host). If you are running spark on a single node, then you don't need --jars – Burrito Dec 06 '19 at 20:17
4

A slightly more elegant solution:

val props = new Properties
props.put("driver", "org.postgresql.Driver")
sqlContext.read.jdbc("jdbc:postgresql://[host]/[dbname]", props)
jake256
  • 192
  • 1
  • 3
2

As jake256 suggested

"driver", "org.postgresql.Driver"

key-value pair was missing. In my case, I launched pyspark as :

pyspark --jars /path/to/postgresql-9.4.1210.jar

with following instructions :

  from pyspark.sql import DataFrameReader

  url = 'postgresql://192.168.2.4:5432/postgres'
  properties = {'user': 'myUser', 'password': 'myPasswd', 'driver': 'org.postgresql.Driver'}
  df = DataFrameReader(sqlContext).jdbc(
      url='jdbc:%s' % url, table='weather', properties=properties
  )
  df.show()

  +-------------+-------+-------+-----------+----------+
  |         city|temp_lo|temp_hi|       prcp|      date|
  +-------------+-------+-------+-----------+----------+
  |San Francisco|     46|     50|       0.25|1994-11-27|
  |San Francisco|     43|     57|        0.0|1994-11-29|
  |      Hayward|     54|     37|0.239999995|1994-11-29|
  +-------------+-------+-------+-----------+----------+

Tested on :

  • Ubuntu 16.04

  • PostgreSQL server version 9.5.

  • Postgresql driver used is postgresql-9.4.1210.jar

  • and Spark version is spark-2.0.0-bin-hadoop2.6

  • but I am also confident that it should also work on spark-2.0.0-bin-hadoop2.7.

  • Java JDK 1.8 64bits

other JDBC Drivers can be found on : https://www.petefreitag.com/articles/jdbc_urls/

tutorial I followed is on : https://developer.ibm.com/clouddataservices/2015/08/19/speed-your-sql-queries-with-spark-sql/

similar solution was suggested also on : pyspark mysql jdbc load An error occurred while calling o23.load No suitable driver

Community
  • 1
  • 1
aks
  • 554
  • 6
  • 8
0

This error seems to get thrown when you use the wrong version of JDBC driver. Check https://jdbc.postgresql.org/download.html to make sure that you have the right one.

Note in particular:

JDK 1.1 - JDBC 1. Note that with the 8.0 release JDBC 1 support has been removed, so look to update your JDK when you update your server.

JDK 1.2, 1.3 - JDBC 2. JDK 1.3 + J2EE - JDBC 2 EE. This contains additional support for javax.sql classes.

JDK 1.4, 1.5 - JDBC 3. This contains support for SSL and javax.sql, but does not require J2EE as it has been added to the J2SE release. JDK 1.6 - JDBC4. Support for JDBC4 methods is not complete, but the majority of methods are implemented.

JDK 1.7, 1.8 - JDBC41. Support for JDBC4 methods is not complete, but the majority of methods are implemented.

Community
  • 1
  • 1
  • No, this error is returned when there is no driver loaded that accepts the JDBC url on the classpath, as the JDBC url is fine, it means that the driver hasn't been loaded. A JDBC 2 or 3 driver will - usually - load fine under newer Java versions, it simply means newer functionality is not available. One of the few exceptions I know is the SQL Server driver which will refuse to load if you have the wrong version for your JVM version. Note however that using a newer JDBC version might fix the problem, but that would simply be due to driver autoloading support in JDBC 4. – Mark Rotteveel Apr 23 '15 at 13:54
0

see this post please, just place your script after all the options. see this

Community
  • 1
  • 1
zhaozhi
  • 1,491
  • 1
  • 16
  • 19
-1

That’s pretty straightforward. To connect to external database to retrieve data into Spark dataframes additional jar file is required. E.g. with MySQL the JDBC driver is required. Download the driver package and extract mysql-connector-java-x.yy.zz-bin.jar in a path that’s accessible from every node in the cluster. Preferably this is a path on shared file system. E.g. with Pouta Virtual Cluster such path would be under /shared_data, here I use /shared_data/thirdparty_jars/.

With direct Spark job submissions from terminal one can specify –driver-class-path argument pointing to extra jars that should be provided to workers with the job. However this does not work with this approach, so we must configure these paths for front end and worker nodes in the spark-defaults.conf file, usually in /opt/spark/conf directory.

spark.driver.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar spark.executor.extraClassPath /"your-path"/mysql-connector-java-5.1.35-bin.jar