Removing data from parquet causes it to grow in size- why?

Question

Recently we discovered that due to an issue in our ETL our parquets had duplicate rows within them.

We launched a project to remove the duplicate rows (read the parquets, deduplicate and write back). Amazingly we noticed that the parquets actually grew in size!

How can this be explained? is it possible that due to smaller amounts of data there are certain compressions that simply do not kick in?

Alternatively, should we be looking for a bug in the deduplication logic (however unlikely)?

[Why are Spark Parquet files for an aggregate larger than the original?](https://stackoverflow.com/q/38153935) — Alper t. Turker, May 10 '18 at 07:22
@user9613318 Thanks. If you post this as an answer I will accept it. — Vitaliy, May 13 '18 at 09:00

score 0 · Answer 1 · answered Mar 08 '19 at 15:03

0

You can't actually delete a record from a parquet file. If you delete a record, it will still be there. But additional information about which record was 'deleted' is added to the parquet file.

answered Mar 08 '19 at 15:03

Menzies

112
1
5

score 0 · Answer 2 · edited Mar 23 '19 at 18:48

0

It may be related to changes in the parquet file structure. Each row group has its own metadata and if you change the number of row groups the size of the file may grow and it might be the answer to your question.

edited Mar 23 '19 at 18:48

Bhargav Rao

50,140
28
121
140

answered Mar 23 '19 at 18:42

Ori N

555
10
22

Removing data from parquet causes it to *grow* in size- why?

2 Answers2

Removing data from parquet causes it to grow in size- why?