0

I'm trying to load data from MS SQL server using pyspark in Jupyter Notebook. Spark is tested and works fine. I'm using following:

from pyspark import SparkContext, SparkConf, SQLContext

appName = "PySpark SQL Server Example - via JDBC"
master = "local"
conf = SparkConf() \
    .setAppName(appName) \
    .setMaster(master) \
    .set("spark.driver.extraClassPath","mssql-jdbc-7.4.1.jre8.jar")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

# Loading data from a JDBC source
jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:postgresql:dbserver") \
    .option("url", "jdbc:sqlserver://188.188.188.188:10004;databaseName=dbnme") \
    .option("dbtable", "dbo.tablename") \
    .option("user", "usernmame") \
    .option("password", "pawwrod") \
    .load()

My MS SQL driver (mssql-jdbc-7.4.1.jre8.jar) jar is in the same location where my python script is.

Errors I get:

enter image description here

and:

enter image description here

Hrvoje
  • 13,566
  • 7
  • 90
  • 104

2 Answers2

0

EDIT:

Please check this link : Characters that are not allowed in table name & column name in sql server ?

The first character must be one of the following:

  • A letter as defined by the Unicode Standard 3.2. The Unicode definition of letters includes Latin characters from a through z, from A through Z, and also letter characters from other languages.

  • The underscore (_), at sign (@), or number sign (#).

Subsequent characters can include the following:

  • Letters as defined in the Unicode Standard 3.2.

  • Decimal numbers from either Basic Latin or other national scripts.

  • The at sign, dollar sign ($), number sign, or underscore.

Please try using this function I made for connecting to SQL databases (it has the driver as a part of the connection properties, and it downloads the driver at runtime) :

def connect_to_sql(
    spark, jdbc_hostname, jdbc_port, database, data_table, username, password
):
    jdbc_url = "jdbc:sqlserver://{0}:{1}/{2}".format(jdbc_hostname, jdbc_port, database)

    connection_details = {
        "user": username,
        "password": password,
        "driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver",
    }

    df = spark.read.jdbc(url=jdbc_url, table=data_table, properties=connection_details)
    return df

Just pass the necessary arguments, and it should get working. Let me know if it doesn't, I'll make necesary changes.

pissall
  • 7,109
  • 2
  • 25
  • 45
0

I use Apache Spark 2.4.4 with Hadoop 2.7 and above. Here is the code that worked in for me in the end:

from pyspark import SparkContext, SparkConf, SQLContext

appName = "PySpark SQL Server Example - via JDBC"
master = "local"
conf = SparkConf() \
    .setAppName(appName) \
    .setMaster(master) \
    .set("spark.driver.extraClassPath","mssql-jdbc-7.4.1.jre8.jar")
sc = SparkContext.getOrCreate(conf=conf)
sqlContext = SQLContext(sc)
spark = sqlContext.sparkSession

hostname = "localhost"
database = "HumanResources"
port = "1433"
table = "dbo.Employee"
user = "sa"
password  = "Dedo9090"

jdbcDF = spark.read.format("jdbc") \
    .option("url", f"jdbc:sqlserver://ILI-LAB-HRVOJE;databaseName={database}") \
    .option("dbtable", table) \
    .option("user", user) \
    .option("password", password) \
    .load()

jdbcDF.head(50)

If you still have problem accessing SQL server see that TCP/IP is enabled as suggested here and also make sure your firewall is not blocking access to 1433 port on which MS SQL server is listening. It was not matter of unsupported chars in password in the end.

Hrvoje
  • 13,566
  • 7
  • 90
  • 104