3

I want to ask how can I connect the SQL Server using Windows Authentication, with pyspark library? I can connect with Microsoft SQL Server Management Studio but not when I try to code in Python with a spark.Here's what I tried so far.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.driver.extraClassPath","mssql-jdbc-6.4.0.jre8.jar") \
    .getOrCreate()

mssql_df = spark.read.format("jdbc") \
    .option("url", "jdbc:sqlserver://localhost:1433;databaseName=DATABASE-NAME") \
    .option("dbtable", "database-table-name") \
    .option("user", "Windows-Username") \
    .option("password", "Windows-Pass")\
    .option("driver", 'com.mysql.jdbc.Driver').load()

mssql_df.printSchema()
mssql_df.show()
Renos Bardhis
  • 325
  • 2
  • 5
  • 11

1 Answers1

3

As shown here you can set the integratedSecurity=true to connect to SQL Server via jdbc and Windows Authentication.

Then Spark configuration it should look as next:

mssql_df = spark.read.format("jdbc") \
    .option("url", "jdbc:sqlserver://localhost:1433;databaseName=DATABASE-NAME;integratedSecurity=true") \
    .option("dbtable", "database-table-name") \
    .option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
    .load()

UPDATE:

As discussed in the comments user should place sqljdbc_auth.dll in the same folder where mssql-jdbc-7.4.1.jre12.jar lives or just set spark.driver.extraClassPath for both jars seperated by : as shown below:

.config("spark.driver.extraClassPath","/path.to/mssql-jdbc-6.4.0.jre8.jar:/path/to/sqljdbc_auth.dll")

sqljdbc_auth.dll is part of the Microsoft JDBC Driver 6.0 for SQL Server and you can download it from here. Alternatively you can just install JDBC driver on your system and specify the path where the dll is stored.

abiratsis
  • 7,051
  • 3
  • 28
  • 46
  • Now I have another problem. The problem now is located on the SparkSession. On the .config("spark.driver.extraClassPath", "mssql-jdbc-7.4.1.jre12.jar"). Rise an issue: FileNotFoundError: [WinError 2] The system cannot find the file specified How can I fix this? – Renos Bardhis Aug 23 '19 at 11:29
  • where is the mssql-jdbc-6.4.0.jre8.jar located? you should specify the absolute path for this .jar explicitly – abiratsis Aug 23 '19 at 11:41
  • an alternative would be to move the mssql-jdbc-6.4.0.jre8.jar file under the jars folder which is the default location for all the jars in my case the path is: `C:\spark-2.4.3-bin-hadoop2.7\jars` – abiratsis Aug 23 '19 at 11:55
  • I moved the jar file to my spark folder as you said and also I gave the absolute path but still the same problem I have. There is another configuration that I can use? – Renos Bardhis Aug 23 '19 at 12:26
  • ok so where are you running your spark job? also how the line `.config("spark.driver.extraClassPath","mssql-jdbc-6.4.0.jre8.jar")` looks like at the moment? – abiratsis Aug 23 '19 at 13:40
  • I run it from Pycharm IDE (I do not know if answer well your question). My current line is now `.config("spark.driver.extraClassPath", "C:\\Users\\renos.bardis\\Desktop\\mssql-jdbc-7.4.1.jre12.jar")` Also, I have this problem sometimes `py4j.protocol.Py4JJavaError: An error occurred while calling o37.load.` How can I vanish all the WARNING like `WARNING: Illegal reflective access by org.apache.spark.unsafe.Platform (file:/C:/Spark/Spark/spark-2.4.3-bin-hadoop2.7/jars/spark-unsafe_2.11-2.4.3.jar) to method java.nio.Bits.unaligned()` – Renos Bardhis Aug 23 '19 at 14:23
  • ok the .jar is indeed in your desktop? is the previous path correct? in this case I would really suggest to place it elsewhere ie under the jars folder of Spark – abiratsis Aug 23 '19 at 14:26
  • After some modifications, I do not have the problem for the file path but with: `py4j.protocol.Py4JJavaError: An error occurred while calling o37.load. : com.microsoft.sqlserver.jdbc.SQLServerException: This driver is not configured for integrated authentication. ClientConnectionId:26acbe77-7e4b-4c4d-8a05-37786ae7f335 at com.microsoft.sqlserver.jdbc.SQLServerConnection.terminate(SQLServerConnection.java:2924) at com.microsoft.sqlserver.jdbc.AuthenticationJNI.(AuthenticationJNI.java:73)` – Renos Bardhis Aug 23 '19 at 14:30
  • ok so the issue is discussed [here](https://stackoverflow.com/questions/48169801/sqlexception-this-driver-is-not-configured-for-integrated-authentication-tomcat) you need to specify the path for sqljdbc_auth.dll. You can search if it is present already in your system or download it from [here](https://www.microsoft.com/en-us/download/details.aspx?id=11774). In both cases you will need it to copy sqljdbc_auth.dll into the folder where `mssql-jdbc-7.4.1.jre12.jar` is – abiratsis Aug 23 '19 at 14:45
  • The problem fixed. Thank you @AlexandrosBiratsis – Renos Bardhis Aug 23 '19 at 15:16
  • I followed all your instructions but I have a problem when I use `df.count()` the program shows `pyspark.sql.utils.IllegalArgumentException: 'Unsupported class file major version 56'`. How can I fix this (My java version is 1.8.0_221) – Renos Bardhis Aug 28 '19 at 14:42
  • when I put on Pycharm these commands `import findspark import os spark_location='C:\\Spark\\Spark\\spark-2.4.3-bin-hadoop2.7' # Set your own java8_location= 'C:\\Program Files\\Java\\jdk1.8.0_201' # Set your own os.environ['JAVA_HOME'] = java8_location findspark.init(spark_home=spark_location)` I have this issue: `: java.lang.UnsupportedClassVersionError: com/microsoft/sqlserver/jdbc/SQLServerDriver has been compiled by a more recent version of the Java Runtime (class file version 56.0), this version of the Java Runtime only recognizes class file versions up to 52.0 ` – Renos Bardhis Aug 28 '19 at 15:10
  • Hi Reno the problem is that the jdbc driver in incompatible with the Java version in your system. So you should try to sync those two either but installing a different Java version or another jdbc version – abiratsis Aug 28 '19 at 16:15
  • Again I tried different things but the same thing: `pyspark.sql.utils.IllegalArgumentException: 'Unsupported class file major version 56` when I use the command `df.count()`. Really I do not know what else to do. – Renos Bardhis Aug 29 '19 at 14:14
  • No worries everything is solvable. What is the version of the Java and the MS JDBC driver in your system? The Java version you can find with ’java -version’ and the jdbc just check the dll version – abiratsis Aug 29 '19 at 15:50
  • I use now "12.0.2" java version and mssql-jdbc-7.4.1.jre12 and sqljdbc_auth.dll (x64) and still when I type the command `df.count()` I have this problem `pyspark.sql.utils.IllegalArgumentException: 'Unsupported class file major version 56'` – Renos Bardhis Aug 30 '19 at 09:49
  • Also when I use java8 I have the Runtime problem with the versioning and when I switch I have the above problem and I do not understand at all. – Renos Bardhis Aug 30 '19 at 10:28
  • Can you post the whole error message? These two problems dont seem to be related – abiratsis Aug 30 '19 at 10:39
  • The message is huge how can I put it all here? – Renos Bardhis Aug 30 '19 at 10:48
  • You can't maybe you should create a new question? – abiratsis Aug 30 '19 at 10:53
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/198723/discussion-between-alexandros-biratsis-and-renos-bardhis). – abiratsis Aug 30 '19 at 11:01