1

I am having issues with reading data from MSSQL server using Cloudera Spark. I am not sure where is the problem and what is causing it.

Here is my build.sbt

val sparkversion = "1.6.0-cdh5.10.1"
name := "SimpleSpark"
organization := "com.huff.spark"
version := "1.0"
scalaVersion := "2.10.5"
mainClass in Compile := Some("com.huff.spark.example.SimpleSpark")
assemblyJarName in assembly := "mssql.jar"


libraryDependencies ++= Seq(
    "org.apache.spark" %% "spark-streaming-kafka" % "1.6.0" % "provided",
    "org.apache.spark" %% "spark-streaming" % "1.6.0" % "provided",
    "org.apache.spark" % "spark-core_2.10" % sparkversion  % "provided", // to test in cluseter
    "org.apache.spark" % "spark-sql_2.10" % sparkversion % "provided" // to test in cluseter
)

resolvers += "Confluent IO" at "http://packages.confluent.io/maven"
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos"

And here is my scala source:

package com.huff.spark.example

import org.apache.spark.sql._
import java.sql.{Connection, DriverManager}
import java.util.Properties
import org.apache.spark.{SparkContext, SparkConf}

object SimpleSpark {
    def main(args: Array[String]) {
        val sourceProp = new java.util.Properties
        val conf = new SparkConf().setAppName("SimpleSpark").setMaster("yarn-cluster")  //to test in cluster
        val sc = new SparkContext(conf)
        var SqlContext = new SQLContext(sc)
        val driver = "com.microsoft.sqlserver.jdbc.SQLServerDriver"

        val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()

            jdbcDF.show(5)
    }
}

And this is the error I see:

17/05/24 04:35:20 ERROR ApplicationMaster: User class threw exception: java.lang.NullPointerException
java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:155)
    at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.<init>(JDBCRelation.scala:91)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:222)
    at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:146)
    at com.huff.spark.example.SimpleSpark$.main(SimpleSpark.scala:16)
    at com.huff.spark.example.SimpleSpark.main(SimpleSpark.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:552)
17/05/24 04:35:20 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.lang.NullPointerException)

I know the problem is in line 16 which is:

val jdbcDF = SqlContext.read.format("jdbc").options(Map("url" -> "jdbc:sqlserver://sqltestsrver;databaseName=LEh;user=sparkaetl;password=sparkaetl","driver" -> driver,"dbtable" -> "StgS")).load()

But I can't pinpoint out what exactly is the problem. Is it something to do with access? (which is doubtful), problems with connection parameters (the error message would say it), or something else which I am not aware of. Thanks in advance :-)

Jaanna
  • 1,620
  • 9
  • 26
  • 46
  • Looks like duplicate https://stackoverflow.com/questions/39318667/how-to-connect-pyspark-to-postgres-database-using-jdbc – pasha701 May 24 '17 at 09:40
  • Possible duplicate of [How to connect (Py)Spark to Postgres database using JDBC](https://stackoverflow.com/questions/39318667/how-to-connect-pyspark-to-postgres-database-using-jdbc) – Alper t. Turker Jun 02 '18 at 12:44

2 Answers2

1

If you are using azure SQL server please copy the jdbc connection string from azure portal. I tried and it worked for me.

Azure databricks using scala mode:

import com.microsoft.sqlserver.jdbc.SQLServerDriver
import java.sql.DriverManager
import org.apache.spark.sql.SQLContext
import sqlContext.implicits._

// MS SQL JDBC Connection String ... 
val jdbcSqlConn = "jdbc:sqlserver://***.database.windows.net:1433;database=**;user=***;password=****;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30;"

// Loading the ms sql table via spark context into dataframe
val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> jdbcSqlConn,
"driver" -> "com.microsoft.sqlserver.jdbc.SQLServerDriver",
"dbtable" -> "***")).load()

// Registering the temp table so that we can SQL like query against the table 
jdbcDF.registerTempTable("yourtablename")
// selecting only top 10 rows here but you can use any sql statement
val yourdata = sqlContext.sql("SELECT * FROM yourtablename LIMIT 10")
// display the data 
yourdata.show()
bunbun
  • 2,595
  • 3
  • 34
  • 52
Anonymous
  • 11
  • 2
0

The NPE occurs when you try to close the database Connection which indicates that the system could not obtain the proper connector via JdbcUtils.createConnectionFactory. You should check your connection URL and the logs for failures.

Till Rohrmann
  • 13,148
  • 1
  • 25
  • 51