I just use spark to read a parquet file and do a repartition(1)
shuffle; then save back to parquet file. Wired thing is that new file is much larger than the original one. Even the metadata file is hundreds kb larger than the original one. Has anyone noticed that issue? Is there any way to make parquet files as small as possible under one compression strategy(ex: .gz format)?
Edit: I read other post and get the basic idea of this issue. I still hope to discuss that which kind of column we should choose to do the sorting work. I hope to find the general optimized strategy to do this work.