2

I am trying to connect teradata server through PySpark.

My CLI code is as below,

from pyspark.sql import SparkSession
spark=SparkSession.builder
                  .appName("Teradata connect")
                  .getOrCreate()
df = sqlContext.read
               .format("jdbc")
               .options(url="jdbc:teradata://xy/",
                        driver="com.teradata.jdbc.TeraDriver",
                        dbtable="dbname.tablename",
                        user="user1",password="***")
               .load()

Which is giving error,

py4j.protocol.Py4JJavaError: An error occurred while calling o159.load. : java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver

To resolve this I think, I need to add jar terajdbc4.jar and `tdgssconfig.jar.

In Scala, to add jar we can use

    sc.addJar("<path>/jar-name.jar")

If I use the same for PySpark, I am having error,

AttributeError: 'SparkContext' object has no attribute 'addJar'.

or

AttributeError: 'SparkSession' object has no attribute 'addJar'

How can I add jar terajdbc4.jar and tdgssconfig.jar?

howie
  • 2,587
  • 3
  • 27
  • 43
Soumya
  • 115
  • 2
  • 8
  • 1
    `pyspark2 --jars /data/1/gcgeeapmxtldu/lib/tdgssconfig.jar,/data/1/gcgeeapmxtldu/lib/terajdbc4.jar` `spark = SparkSession.builder.appName("sparkanalysis")\ .config("spark.driver.extraClassPath","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar")\ .config("spark.executor.extraClassPath","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar")\ .config("spark.jars","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar")\ .config("spark.repl.local.jars","/local_path/tdgssconfig.jar,/local_path/terajdbc4.jar")\ .getOrCreate()` – Soumya May 05 '19 at 01:26
  • `df = spark.read.format("jdbc")\ .option("url","jdbc:teradata://xyz")\ .option("driver","com.teradata.jdbc.TeraDriver")\ .option("dbtable","table").option("user","USR1").option("password","*****")\ .load()` – Soumya May 05 '19 at 01:27

1 Answers1

1

Try following this post which explains how to add jdbc drivers to pyspark.

How to add jdbc drivers to classpath when using PySpark?

The above example is for postgres and docker, but the answer should work for your scenario. Note, you are correct about the driver files. Most JDBC drivers are in a single file, but Teradata splits it out into two parts. I think one is the actual driver and the other (tdgss) has security stuff in it. Both files must be added to the classpath for it to work.

Alternatively, simply google "how to add jdbc drivers to pyspark".

GMc
  • 1,764
  • 1
  • 8
  • 26
  • I have used below to open CLI, pyspark2 --driver-class-path /path/terajdbc4.jar:/path/tdgssconfig.jar but received error as _py4j.protocol.Py4JJavaError: An error occurred while calling o76.load. : java.lang.ClassNotFoundException: com.teradata.jdbc.TeraDriver_ – Soumya May 03 '19 at 02:49
  • I believe you need to use a comma (not a colon) to separate the jar file names to the spark shells. This is certainly true for the scala environment (spark-shell). Additionally, you should consider where to place the files, if you have HDFS, it is very likely that the spark shell (including pyspark) will try to find the files in HDFS at the path you specify. If you are still getting the error, try putting the files in HDFS and give that path to the pyspark. – GMc May 03 '19 at 06:17
  • Finally, I got it fixed, the problem was, my jar fils got corrupted and i had some fix in my code, first, CLI command 'pyspark2 --jars /local_path/tdgssconfig.jar,/local_path/terajdbc4.jar' second, for sparksession add blw,' .config("spark.driver.extraClassPath","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar") \ .config("spark.executor.extraClassPath","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar") \ .config("spark.jars","/local_path/terajdbc4.jar,/local_path/tdgssconfig.jar")\ .config("spark.repl.local.jars","/local_path/tdgssconfig.jar,/local_path/terajdbc4.jar") – Soumya May 05 '19 at 01:15