Upserts on Delta simply duplicates data?

Question

I'm fairly new with Delta and lakehouse on databricks. I have some questions, based on the following actions:

I import some parquet files
Convert them to delta (creating 1 snappy.parquet file)
Delete one random row (creating 1 new snappy.parquet file).
I check content of both snappy files (version 0 of delta table, and version1), and they both contain all of the data, each one with it's specific differences.

Does this mean delta simply duplicates data for every new version?

How is this scalable? or am I missing something?

score 1 · Accepted Answer · answered Feb 07 '22 at 07:22

1

Yes, that's how Delta lake works - when you're doing modification of the data, it won't write only delta, but takes the original file that is affected by change, make changes, and write it back. But take into account that not all data is duplicated - only that were in the file where affected rows are. For example, you have 3 data files, and you're making changes to some rows that are in the 2nd file. In this case, Delta will create a new file with number 4 that contains necessary changes + the rest of data from file 2, so you will have following versions:

Version 0: files 1, 2 & 3
Version 1: files, 1, 3 & 4

answered Feb 07 '22 at 07:22

Alex Ott

80,552
8
87
132

1

If you read it as parquet you will see duplicates. If you read it as Delta then you won’t. – GregGalloway Feb 07 '22 at 09:51
I know. But question here primarily was about that we have the same data – Alex Ott Feb 07 '22 at 10:20

Upserts on Delta simply duplicates data?

1 Answers1