7

Is it possible to implement a delta lake on-premise ? if yes, what softwares/tools needs to be installed?

I'm trying to implement a delta lake on premise to analyze some log files and database tables. My current machine is loaded with ubuntu, apache spark. Not sure what other tools are required.

Are there any other tool suggestions to implement on-premise data lake concept?

Ajoy
  • 113
  • 1
  • 1
  • 10
  • 2
    Link to [Delta Lake without Databricks Runtime](https://stackoverflow.com/questions/60817234/delta-lake-without-databricks-runtime) which is a similar question. – Hongbo Miao Apr 26 '23 at 20:03

2 Answers2

8

Yes, you can use Delta Lake on-premise. It's just a matter of the using correct version of the Delta library (0.6.1 for Spark 2.4, 0.8.0 for Spark 3.0). Or running the spark-shell/pyspark as following (for Spark 3.0):

pyspark --packages io.delta:delta-core_2.12:0.8.0

then you can write data in Delta format, like this:

spark.range(1000).write.format("delta").mode("append").save("1.delta")

It can work with local files as well, but if you need to build a real data lake, then you need to use something like HDFS that is also supported out of the box.

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
2

Yes, you can build a Delta Lake locally. The delta-rs project provides Python APIs for creating Delta Tables without a Spark dependency. You can install the python package with pip:

$ pip install deltalake

Or you can install with conda:

$ conda install -c conda-forge delta-spark

Then, using a pandas DataFrame as your source, you can create your Delta Table as follows:

import pandas as pd
from deltalake.writer import write_deltalake

df = pd.DataFrame({'x': [1, 2, 3]})
write_deltalake('path/to/table', df)
Jim Hibbard
  • 205
  • 1
  • 6