Delta Lake format is supported on Dataproc. You can simply use it as any other data format such as Parquet and ORC. The following is an example from this article.
# Copyright 2022 Google LLC.
# SPDX-License-Identifier: Apache-2.0
import sys
from pyspark.sql import SparkSession
from delta import *
def main():
input = sys.argv[1]
print("Starting job: GCS Bucket: ", input)
spark = SparkSession\
.builder\
.appName("DeltaTest")\
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")\
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")\
.getOrCreate()
data = spark.range(0, 500)
data.write.format("delta").mode("append").save(input)
df = spark.read \
.format("delta") \
.load(input)
df.show()
spark.stop()
if __name__ == "__main__":
main()
You also need to add the dependency when submitting the job with --properties="spark.jars.packages=io.delta:delta-core_2.12:1.1.0"
.