Parquet storage size higher for duplicate data

Question

I have a dataset which has close to 2 billion rows in parquet format which spans in 200 files. It occupies 17.4GB on S3. This dataset has close to 45% of duplicate rows. I deduplicated the dataset using 'distinct' function in Spark, and wrote it to a different location on S3.

I expected the data storage to be reduced by half. Instead, the deduplicated data took 34.4 GB (double of that which had duplicates).

I took to check the metadata of these two datasets. I found that there is a difference in the column encoding of the duplicate and deduplicated data.

Difference in column encodings

I want to understand how to get the expected behavior of reducing the storage size.

Having said that, I have a few questions further:

I also want to understand if this anomaly affect the performance in any way. In my process, I am having to do apply lot of filters on these columns and using distinct function while persisting the filtered data.
I have seen in a few parquet blogs online that encoding for a column is only one. In this case I see more than one column encodings. Is this normal?

1. Have you checked the count before and after distinct? Does the reduction seem correct? 2. Are you using any specific partitioning logic? EG, partition by the most duplicated col or largest col by size, etc — moon, May 05 '20 at 21:30
1. Count before duplicate: 427519575 Count after removing duplicates: 290965749 2. Not using any partitioning at all. — Phanindra Kothoori, May 07 '20 at 06:01
Then I'd try to partition the data by the largest col in size before writing. Just to get the max compression out of parquet. If size reminds the same or worse, then maybe something fishy in the dedup process?! — moon, May 08 '20 at 01:28
Quick question, is there a way to find the largest col in size? I know my data, but I want to learn if there's a way to do so. — Phanindra Kothoori, May 13 '20 at 17:37
maybe [this](https://stackoverflow.com/questions/54870668/how-can-i-estimate-the-size-in-bytes-of-each-column-in-a-spark-dataframe) — moon, May 13 '20 at 20:15

Parquet storage size higher for duplicate data

0 Answers0