Pyspark Equivalent of Pyodbc?

Question

I currently have the following Python code to read in a table from a local SQL Server db into Pandas:

import pandas as pd
import pyodbc

# Connect to DB
server = 'server'
db = 'db'

conn = pyodbc.connect('DRIVER={SQL SERVER}; SERVER=' + server + '; DATABASE=' + db + '; TRUSTED_CONNECTION=yes')
cursor = conn.cursor()

table = 'table'
df = pd.read_sql('Select * From ' + table, conn)

That code works, but now I would like to do the same thing in Pyspark. What is the equivalent of this code in Pyspark?

I have tried the following:

import findspark
import os
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

# didn't know which of these would work so tried both
os.environ['SPARK_CLASSPATH'] = 'path/to/sqljdbc42.jar'
os.environ['driver-class-path'] = 'path/to/sqljdbc42.jar'

findspark.init('C:/spark/spark')

spark = SparkSession \
        .builder \
        .appName("SparkCoreTest") \
        .getOrCreate()

sc = spark.sparkContext
sqlctx = SQLContext(sc)

server = 'server'
db = 'db'
url = 'jdbc:sqlserver//' + server + ';databaseName=' + db
table = 'table'
properties = {'driver' : 'com.microsoft.sqlserver.jdbc.SQLServerDriver'}

df = sqlctx.read.format('jdbc').options(url=url, dbtable=table, driver='{SQL SERVER}').load()

This gives java.lang.ClassNotFoundException: {SQL SERVER}. Throughout this process I've also gotten a errors resulting from not being able to find a "suitable driver," although I think I've fixed those by changingos.environ. Any help would be greatly appreciated!

You obviously mean "pyspark equivalent of pyodbc"... – desertnaut Nov 28 '17 at 15:49 — desertnaut, Nov 28 '17 at 15:49

mkaran · Answer 1 · 2017-11-28T16:35:16.383

You are quite close. The ClassNotFoundException means that it cannot locate the driver jar.

In order to use drivers for MySQL, SQL Server etc, it is important to have the jars in a folder that is accessible from both master and slaves (if you run spark locally then you are ok) and you need to specify the location of the jars either with the --jars flag like this:

spark-submit --jars /path/to/sqljdbc42.jar ...other params... yourscript.py

or :

# create a configuration instance and set the spark.jars option
conf = SparkConf()
conf.set('spark.jars', '/path/to/sqljdbc42.jar')

# give that configuration to the spark session
spark = SparkSession.builder\
                    .config(conf=conf) \
                    .appName(app_name) \
                    .getOrCreate()

Additionally, you should use the spark session not the SQLContext directly to read from the sql server (even thought that depends on your spark version):

df = spark.read.format('jdbc').options(url=url, dbtable=table).load()

(driver jar )

(note: there is also a way to work with environment variables but I have not used it)

Hope this helps, good luck!

Pyspark Equivalent of Pyodbc?

1 Answers1