Recently we discovered that due to an issue in our ETL our parquets had duplicate rows within them.
We launched a project to remove the duplicate rows (read the parquets, deduplicate and write back). Amazingly we noticed that the parquets actually grew in size!
How can this be explained? is it possible that due to smaller amounts of data there are certain compressions that simply do not kick in?
Alternatively, should we be looking for a bug in the deduplication logic (however unlikely)?