2

Is the Databricks Delta format available with Google's GCP DataProc?

For AWS and AZURE it is clear that this is so. However, when perusing, researching the internet, I am unsure that this is the case. Databricks docs less clear as well.

I am assuming Google feel their offerings are sufficient. E.g. Google Cloud Storage and is it mutable? This https://docs.gcp.databricks.com/getting-started/overview.html provides too little context.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
thebluephantom
  • 16,458
  • 8
  • 40
  • 83

1 Answers1

1

Delta Lake format is supported on Dataproc. You can simply use it as any other data format such as Parquet and ORC. The following is an example from this article.

# Copyright 2022 Google LLC.
# SPDX-License-Identifier: Apache-2.0
import sys
from pyspark.sql import SparkSession
from delta import *

def main():
    input = sys.argv[1]
    print("Starting job: GCS Bucket: ", input)
    spark = SparkSession\
        .builder\
        .appName("DeltaTest")\
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
        .getOrCreate()
    data = spark.range(0, 500)
    data.write.format("delta").mode("append").save(input)
    df = spark.read \
    .format("delta") \
    .load(input)
    df.show()
    spark.stop()

if __name__ == "__main__":
    main()

You also need to add the dependency when submitting the job with --properties="spark.jars.packages=io.delta:delta-core_2.12:1.1.0".

Dagang
  • 24,586
  • 26
  • 88
  • 133