0

I just use spark to read a parquet file and do a repartition(1)shuffle; then save back to parquet file. Wired thing is that new file is much larger than the original one. Even the metadata file is hundreds kb larger than the original one. Has anyone noticed that issue? Is there any way to make parquet files as small as possible under one compression strategy(ex: .gz format)?

Edit: I read other post and get the basic idea of this issue. I still hope to discuss that which kind of column we should choose to do the sorting work. I hope to find the general optimized strategy to do this work.

G_cy
  • 994
  • 3
  • 13
  • 28
  • 2
    Possible duplicate of [Why are Spark Parquet files for an aggregate larger than the original?](http://stackoverflow.com/questions/38153935/why-are-spark-parquet-files-for-an-aggregate-larger-than-the-original) – eliasah Feb 09 '17 at 23:03

1 Answers1

0

I would like to say that I agree with the idea from the linked post in my post. In my situation, Sorting will be a good choice. In specific, I tested with different columns, also with single and composite columns. Generally, sorting with the columns that contain most information of your file would be a effective strategy. Welcome any comment.

G_cy
  • 994
  • 3
  • 13
  • 28