I have a dataset which has close to 2 billion rows in parquet format which spans in 200 files. It occupies 17.4GB on S3. This dataset has close to 45% of duplicate rows. I deduplicated the dataset using 'distinct' function in Spark, and wrote it to a different location on S3.
I expected the data storage to be reduced by half. Instead, the deduplicated data took 34.4 GB (double of that which had duplicates).
I took to check the metadata of these two datasets. I found that there is a difference in the column encoding of the duplicate and deduplicated data.
Difference in column encodings
I want to understand how to get the expected behavior of reducing the storage size.
Having said that, I have a few questions further:
- I also want to understand if this anomaly affect the performance in any way. In my process, I am having to do apply lot of filters on these columns and using
distinct
function while persisting the filtered data. - I have seen in a few parquet blogs online that encoding for a column is only one. In this case I see more than one column encodings. Is this normal?