2

I have a server without internet access where I would like to use Delta Lake. So the normal use of Delta lake in the spark session does not work. from pyspark.sql import SparkSession

spark = SparkSession \
   .builder \
   .appName("...") \
   .master("...") \
   .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
   .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
   .getOrCreate()

Where shall I copy the Delta-lake github repository? How can I point the spark session to the right libraries

blackbishop
  • 30,945
  • 11
  • 55
  • 76
Stephen
  • 149
  • 1
  • 8
  • 3
    You can download the jar [delta-core_2.12-0.8.0.jar](https://repo1.maven.org/maven2/io/delta/delta-core_2.12/0.8.0/delta-core_2.12-0.8.0.jar) and use it with the option `--jars` in spark-submit or spark-shell/pyspark. – blackbishop Mar 12 '21 at 13:15
  • Thanks, you put me on the right pass. – Stephen Mar 12 '21 at 20:34

1 Answers1

3

Thanks to @blackbishop I found the answer how-to-add-third-party-java-jar-files-for-use-in-pyspark

for Delta lake, download the jar file: delta-core_2_12_0.8.0.jar

You could add the path to jar file using Spark configuration at Runtime.

Here is an example :

    conf = SparkConf().set("spark.jars", "/path-to-jar/spark-streaming-kafka-0-8-assembly_2.11-2.2.1.jar")

sc = SparkContext( conf=conf)

Refer the document for more information.

For Jupyter Notebook:

spark = (SparkSession
    .builder
    .appName("Spark_Test")
    .master('yarn-client')
    .config("spark.sql.warehouse.dir", "/user/hive/warehouse")
    .config("spark.executor.cores", "4")
    .config("spark.executor.instances", "2")
    .config("spark.sql.shuffle.partitions","8")
    .enableHiveSupport()
    .getOrCreate())

# Do this 

spark.sparkContext.addPyFile("/path/to/jar/xxxx.jar")

Link to the source where I found it: https://github.com/graphframes/graphframes/issues/104

Stephen
  • 149
  • 1
  • 8