My orc with snappy compression dataset was 3.3 GB when it was originally constructed via a series of small writes to 128 Kb files. It totals 400 million rows, has one timestamp column, and the rest are string columns no longer than 200 chars. There are 9 columns total.
When I read the whole folder using pyspark.read.orc("myfolder/*)
and simply write out to another folder with no changes, the dataset skyrockets to 4 times the size using the same defaults.
This is a known problem:
Why does the repartition() method increase file size on disk?
Spark repartition dataframe cause datasize increase 10 times
I've tried the following to no avail.
dataframe_out.write.orc(dirname_out) # default write options, 4x increase
dataframe_out.write.option("maxRecordsPerFile", 50000).orc(dirname_out) # 4x increase
dataframe_out.write.orc(dirname_out, compression="zlib") # results in 3x instead of 4x
dataframe_out.write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.coalesce(10000).write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.repartition(10000).write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.repartition(10000, "name_column").write.mode("overwrite").orc(dirname_out) # 4x increase
Can someone give a brief overview of how to best optimize the compression when writing to an orc snappy file? This is not a question of what the best compression is to use; I would just like to the bottom of why using the same compression format is so inconsistent. I'd like get as close to the original dataset size if possible.