1

Recently we discovered that due to an issue in our ETL our parquets had duplicate rows within them.

We launched a project to remove the duplicate rows (read the parquets, deduplicate and write back). Amazingly we noticed that the parquets actually grew in size!

How can this be explained? is it possible that due to smaller amounts of data there are certain compressions that simply do not kick in?

Alternatively, should we be looking for a bug in the deduplication logic (however unlikely)?

Vitaliy
  • 8,044
  • 7
  • 38
  • 66

2 Answers2

0

You can't actually delete a record from a parquet file. If you delete a record, it will still be there. But additional information about which record was 'deleted' is added to the parquet file.

Menzies
  • 112
  • 1
  • 5
0

It may be related to changes in the parquet file structure. Each row group has its own metadata and if you change the number of row groups the size of the file may grow and it might be the answer to your question.

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Ori N
  • 555
  • 10
  • 22