Questions tagged [delta-lake]

Delta Lake is an open source project that supports ACID on top of Apache Spark. It provides ACID transactions, scalable metadata handling, time travel, unified batch and streaming source and sink, and is fully compatible with Apache Spark APIs.

From https://delta.io/:

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be "big data". Delta Lake treats metadata just like data, leveraging Spark's distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
Audit History: Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes.
Updates and Deletes: Delta Lake supports Scala / Java APIs to merge, update and delete datasets. This allows you to easily comply with GDPR and CCPA and also simplifies use cases like change data capture.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

1226 questions

votes

3 answers

Parquet vs Delta format in Azure Data Lake Gen 2 store

I am importing fact and dimension tables from SQL Server to Azure Data Lake Gen 2. Should I save the data as "Parquet" or "Delta" if I am going to wrangle the tables to create a dataset useful for running ML models on Azure Databricks ? What is the…

asked Dec 16 '20 at 09:55

learner

votes

8 answers

Databricks drop a delta table?

How can I drop a Delta Table in Databricks? I can't find any information in the docs... maybe the only solution is to delete the files inside the folder 'delta' with the magic command or dbutils: %fs rm -r delta/mytable? EDIT: For clarification, I…

databricks delta-lake

asked Nov 22 '19 at 09:01

Joanteixi

votes

2 answers

Apache Spark + Delta Lake concepts

I have many doubts related to Spark + Delta. 1) Databricks propose 3 layers (bronze, silver, gold), but in which layer is recommendable to use for Machine Learning and why? I suppose they propose to have the data clean and ready in the gold…

apache-spark apache-kafka data-warehouse databricks delta-lake

asked May 19 '19 at 19:20

Eric Gabriel Bellet Locker

votes

7 answers

How to drop a column from a Databricks Delta table?

I have recently started discovering Databricks and faced a situation where I need to drop a certain column of a delta table. When I worked with PostgreSQL it was as easy as ALTER TABLE main.metrics_table DROP COLUMN metric_1; I was looking…

sql apache-spark apache-spark-sql databricks delta-lake

asked Jan 31 '19 at 09:15

samba

2,821
6
30
85

votes

3 answers

Databricks - How to change a partition of an existing Delta table?

I have a table in Databricks delta which is partitioned by transaction_date. I want to change the partition column to view_date. I tried to drop the table and then create it with a new partition column using PARTITIONED BY (view_date). However my…

databricks delta-lake

asked Mar 04 '19 at 18:12

samba

2,821
6
30
85

votes

3 answers

Databricks - is not empty but it's not a Delta table

I run a query on Databricks: DROP TABLE IF EXISTS dublicates_hotels; CREATE TABLE IF NOT EXISTS dublicates_hotels ... I'm trying to understand why I receive the following error: Error in SQL statement: AnalysisException: Cannot create table…

apache-spark-sql databricks delta-lake

asked Oct 13 '21 at 07:51

QbS

votes

1 answer

lakeFS, Hudi, Delta Lake merge and merge conflicts

I'm reading documentation about lakeFS and right now don't clearly understand what is a merge or even merge conflict in terms of lakeFS. Let's say I use Apache Hudi for ACID support over a single table. I'd like to introduce multi-table ACID support…

delta-lake data-lake apache-hudi lakefs data-lakehouse

asked Oct 03 '21 at 17:34

alexanoid

24,051
54
210
410

votes

3 answers

How to rename a column in Databricks

How do you rename a column in Databricks? The following does not work: ALTER TABLE mySchema.myTable change COLUMN old_name new_name int It returns the error: ALTER TABLE CHANGE COLUMN is not supported for changing column 'old_name' with type…

databricks delta-lake

asked Dec 26 '19 at 17:06

David Maddox

1,884
3
21
32

votes

5 answers

Delta Lake rollback

Need an elegant way to rollback Delta Lake to a previous version. My current approach is listed below: import io.delta.tables._ val deltaTable = DeltaTable.forPath(spark, testFolder) spark.read.format("delta") .option("versionAsOf", 0) …

scala apache-spark databricks rollback delta-lake

asked Aug 26 '19 at 22:53

Fang Zhang

1,597
18
18

votes

2 answers

What are the major differences between S3 lake formation governed tables and databricks delta tables?

What are the major differences between S3 lake formation governed tables and databricks delta tables? they look pretty similar.

amazon-s3 databricks delta-lake aws-lake-formation

asked Dec 06 '21 at 12:01

MGomez

votes

3 answers

How to CREATE TABLE USING delta with Spark 2.4.4?

This is Spark 2.4.4 and Delta Lake 0.5.0. I'm trying to create a table using delta data source and seems I'm missing something. Although the CREATE TABLE USING delta command worked fine neither the table directory is created nor insertInto…

apache-spark apache-spark-sql delta-lake

asked Dec 31 '19 at 16:33

Jacek Laskowski

72,696
27
242
420

votes

6 answers

What is the correct way to install the delta module in python?

What is the correct way to install the delta module in python?? In the example they import the module from delta.tables import * but i did not find the correct way to install the module in my virtual env Currently i am using this spark param…

pyspark databricks delta-lake

asked Dec 17 '19 at 11:37

ofriman

votes

2 answers

How to write to delta table/delta format in Python without using Pyspark?

I am looking for a way to write back to a delta table in python without using pyspark. I know there is a library called deltalake/delta-lake-reader that can be used to read delta tables and convert them to pandas dataframes. The goal is to write…

python pandas dataframe delta-lake

asked Oct 01 '21 at 14:09

FRITTENPIET

votes

2 answers

check if delta table exists on a path or not in databricks

I need to delete certain data from a delta-lake table before I load it. I am able to delete the data from delta table if it exists but it fails when the table does not exist. Databricks scala code below // create delete statement val del_ID =…

scala databricks delta-lake

asked Oct 06 '20 at 16:39

VNK

votes

4 answers

How to add a new column to a Delta Lake table?

I'm trying to add a new column to data stored as a Delta Table in Azure Blob Storage. Most of the actions being done on the data are upserts, with many updates and few new inserts. My code to write data currently looks like…

scala apache-spark databricks azure-databricks delta-lake

asked Aug 21 '20 at 19:07

Comrade_Question

2 3

…

81 82 Next