Unable to connect to MS SQL from Apache Spark using pyspark on Jupyter notebook

Question

I'm trying to load data from MS SQL server using pyspark in Jupyter Notebook. Spark is tested and works fine. I'm using following:

from pyspark import SparkContext, SparkConf, SQLContext

appName = "PySpark SQL Server Example - via JDBC"
master = "local"
conf = SparkConf() \
    .setAppName(appName) \
    .setMaster(master) \
    .set("spark.driver.extraClassPath","mssql-jdbc-7.4.1.jre8.jar")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

# Loading data from a JDBC source
jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("url", "jdbc:sqlserver://188.188.188.188:10004;databaseName=dbnme") \
    .option("dbtable", "dbo.tablename") \
    .option("user", "usernmame") \
    .option("password", "pawwrod") \
    .load()

My MS SQL driver (mssql-jdbc-7.4.1.jre8.jar) jar is in the same location where my python script is.

Errors I get:

and:

Please change the characters of the blacked out strings, and paste what is the `option`. If there are any special characters etc in it — pissall, Oct 18 '19 at 11:32

pissall · Answer 1 · 2019-10-18T11:40:06.907

EDIT:

Please check this link : Characters that are not allowed in table name & column name in sql server ?

The first character must be one of the following:

A letter as defined by the Unicode Standard 3.2. The Unicode definition of letters includes Latin characters from a through z, from A through Z, and also letter characters from other languages.
The underscore (_), at sign (@), or number sign (#).

Subsequent characters can include the following:

Letters as defined in the Unicode Standard 3.2.
Decimal numbers from either Basic Latin or other national scripts.
The at sign, dollar sign ($), number sign, or underscore.

Please try using this function I made for connecting to SQL databases (it has the driver as a part of the connection properties, and it downloads the driver at runtime) :

def connect_to_sql(
    spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):
    jdbc_url = "jdbc:sqlserver://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)

    connection_details = {
        "user": username,
        "password": password,
        "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
    }

    df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
    return df

Just pass the necessary arguments, and it should get working. Let me know if it doesn't, I'll make necesary changes.

I'll try your function now. "_" is not in column name it's just in password. — Hrvoje, Oct 18 '19 at 11:38

score 0 · Accepted Answer · answered Oct 22 '19 at 00:56

I use Apache Spark 2.4.4 with Hadoop 2.7 and above. Here is the code that worked in for me in the end:

from pyspark import SparkContext, SparkConf, SQLContext

appName = "PySpark SQL Server Example - via JDBC"
master = "local"
conf = SparkConf() \
    .setAppName(appName) \
    .setMaster(master) \
    .set("spark.driver.extraClassPath","mssql-jdbc-7.4.1.jre8.jar")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

hostname = "localhost"
database = "HumanResources"
port = "1433"
table = "dbo.Employee"
user = "sa"
password  = "Dedo9090"

jdbcDF = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://ILI-LAB-HRVOJE;databaseName={database}") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .load()

jdbcDF.head(50)

If you still have problem accessing SQL server see that TCP/IP is enabled as suggested here and also make sure your firewall is not blocking access to 1433 port on which MS SQL server is listening. It was not matter of unsupported chars in password in the end.

Unable to connect to MS SQL from Apache Spark using pyspark on Jupyter notebook

2 Answers2