Does Dataproc support Delta Lake format?

Question

Is the Databricks Delta format available with Google's GCP DataProc?

For AWS and AZURE it is clear that this is so. However, when perusing, researching the internet, I am unsure that this is the case. Databricks docs less clear as well.

I am assuming Google feel their offerings are sufficient. E.g. Google Cloud Storage and is it mutable? This https://docs.gcp.databricks.com/getting-started/overview.html provides too little context.

Yes, it is possible to use delta lake tables in GCP. You can add the delta-core JAR while creating dataproc cluster or when submitting a job then write and read delta format in GCS (`gs://...`) as you would do on the other platforms. — blackbishop, Jun 15 '22 at 12:32
@blackbishop you can answer then, but I thought it was in preview still, thx — thebluephantom, Jun 15 '22 at 12:36
https://medium.com/google-cloud/processing-databricks-delta-lake-data-in-google-cloud-dataproc-serverless-for-spark-1cc1405a3ee4 — Alex Ott, Jun 15 '22 at 12:43
yes, if you start DataProc with necessary libraries, as @blackbishop mentioned — Alex Ott, Jun 15 '22 at 14:42
@AlexOtt that's how do with HDP, just could not see the ref's but now we have — thebluephantom, Jun 15 '22 at 14:43
that article that I've put (google cloud medium) is explicitly about dataproc + delta lake on google storage — Alex Ott, Jun 15 '22 at 14:46

Dagang · Accepted Answer · 2023-08-18T04:09:09.210

Delta Lake format is supported on Dataproc. You can simply use it as any other data format such as Parquet and ORC. The following is an example from this article.

# Copyright 2022 Google LLC.
# SPDX-License-Identifier: Apache-2.0
import sys
from pyspark.sql import SparkSession
from delta import *

def main():
    input = sys.argv[1]
    print("Starting job: GCS Bucket: ", input)
    spark = SparkSession\
        .builder\
        .appName("DeltaTest")\
        .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
        .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
        .getOrCreate()
    data = spark.range(0, 500)
    data.write.format("delta").mode("append").save(input)
    df = spark.read \
    .format("delta") \
    .load(input)
    df.show()
    spark.stop()

if __name__ == "__main__":
    main()

You also need to add the dependency when submitting the job with --properties="spark.jars.packages=io.delta:delta-core_2.12:1.1.0".

Does Dataproc support Delta Lake format?

1 Answers1

Linked