PySpark 1.5 & MSSQL jdbc

Question

I am using PySpark on Spark 1.5 on Cloudera YARN, using Python 3.3 on Centos 6 Machines. The SQL Server instance is SQL Server Enterprise 64bit. The SQL Server driver is listed below; sqljdbc4.jar; and I have added to my .bashrc

export SPARK_CLASSPATH="/var/lib/spark/sqljdbc4.jar"
export PYSPARK_SUBMIT_ARGS="--conf spark.executor.extraClassPath="/var/lib/spark/sqljdbc4.jar" --driver-class-path="/var/lib/spark/sqljdbc4.jar" --jars="/var/lib/spark/sqljdbc4.jar" --master yarn --deploy-mode client"

And I can see confirmation when I launch Spark that

SPARK_CLASSPATH was detected (set to '/var/lib/spark/sqljdbc4.jar')

I have a dataframe that looks like this schema

root
 |-- daytetime: timestamp (nullable = true)
 |-- ip: string (nullable = true)
 |-- tech: string (nullable = true)
 |-- th: string (nullable = true)
 |-- car: string (nullable = true)
 |-- min_dayte: timestamp (nullable = true)
 |-- max_dayte: timestamp (nullable = true)

I have created an empty table already in my MS SQL server called 'dbo.shaping', where the 3 timestamp columns will be datetime2(7) and the others nvarchar(50).

I try to export the dataframe from PySpark using this

properties = {"user": "<username>", "password": "<password>"} 

df.write.format('jdbc').options(url='<IP>:1433/<dbname>', dbtable='dbo.shaping',driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",properties=properties)

I get the following traceback error

Py4JError: An error occurred while calling o250.option. Trace:
py4j.Py4JException: Method option([class java.lang.String, class java.util.HashMap]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:342)
at py4j.Gateway.invoke(Gateway.java:252)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:207)
at java.lang.Thread.run(Thread.java:744)

Are my methods at least correct, and perhaps this error is related to writing the specific type of data, ie, I have an issue with the data construct and not my code?

You're resurrecting a question that's over a year old. Have you verified it's still relevant as asked (in the face of things like newer versions of software)? — Jeroen Mostert, Jul 05 '17 at 21:52
Software updates are not possible on this infra. Must be pyspark 1.5 solution. — PR102012, Jul 06 '17 at 12:48
pyspark 1.5 is one thing, but the Microsoft JDBC driver for SQL Server has also undergone updates. Your error has all the hallmarks of a version mismatch between components, it's just not quite clear which ones. I recommend explicitly listing the version numbers of everything you're using (python, pyspark, JDBC driver, SQL Server, OS), otherwise there's little hope of someone reproducing it. (Which is also why I'm skeptical that this is "widely applicable to a large audience", but I have no experience with pyspark.) — Jeroen Mostert, Jul 06 '17 at 12:51

zero323 · Accepted Answer · 2017-07-08T09:57:03.673

You cannot use a dict as a value for options. options method expects only str arguments (Scala docs and PySpark annotations) and is expanded to separate calls to Java option.

In current Spark versions value is automatically converted to string, so your code would fail silently, but it isn't the case in 1.5.

Since properties are specific to JDBC driver anyway, you should use jdbc method:

properties = {
    "user": "<username>", "password": "<password>", "driver": 
    "com.microsoft.sqlserver.jdbc.SQLServerDriver"}

df.write.jdbc(
    url='<IP>:1433/<dbname>',
    table='dbo.shaping',
    properties=properties)

though unpacking properties should work as well:

.options(
    url='<IP>:1433/<dbname>',
    dbtable='dbo.shaping',
    driver="com.microsoft.sqlserver.jdbc.SQLServerDriver",
    **properties)

In general, when you see:

py4j.Py4JException: Method ... does not exist

it usually signalizes mismatch between local Python types, and the types expected by JVM method in use.

See also: How to use JDBC source to write and read data in (Py)Spark?

I'm including the properties for 'user','password' and 'driver'; as you have here. However, I am now receiving the error 'Py4JJavaError: An error occurred while calling o230.jdbc. : java.sql.SQLException: No suitable driver found for '. .. Is it possible b/c I am on YARN, that the .jar file which is the driver and included in my .bashrc on my Mgmt/Execution node, is not in the same directory in every other non-master node? Thus, when I use multiple nodes, some don't have the jar? — PR102012, Jul 09 '17 at 18:03
JDBC driver has to be present on every worker node. Personally I'd use `--packages` option, thought in the client mode, you should be able to pass local jar with `--jars`. — zero323, Jul 10 '17 at 09:38

PySpark 1.5 & MSSQL jdbc

1 Answers1