How can I optimize orc snappy compression in spark?

Question

My orc with snappy compression dataset was 3.3 GB when it was originally constructed via a series of small writes to 128 Kb files. It totals 400 million rows, has one timestamp column, and the rest are string columns no longer than 200 chars. There are 9 columns total.

When I read the whole folder using pyspark.read.orc("myfolder/*) and simply write out to another folder with no changes, the dataset skyrockets to 4 times the size using the same defaults.

This is a known problem:

I've tried the following to no avail.

dataframe_out.write.orc(dirname_out) # default write options, 4x increase
dataframe_out.write.option("maxRecordsPerFile", 50000).orc(dirname_out) # 4x increase
dataframe_out.write.orc(dirname_out, compression="zlib") # results in 3x instead of 4x
dataframe_out.write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.coalesce(10000).write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.repartition(10000).write.mode("overwrite").orc(dirname_out) # 4x increase
dataframe_out.repartition(10000, "name_column").write.mode("overwrite").orc(dirname_out) # 4x increase

Can someone give a brief overview of how to best optimize the compression when writing to an orc snappy file? This is not a question of what the best compression is to use; I would just like to the bottom of why using the same compression format is so inconsistent. I'd like get as close to the original dataset size if possible.

One thing that worked is to use the default "spark.sql.autoBroadcastJoinThreshold" setting. Managed to cut the larger dataset in half, resulting in 2x as opposed to 4x increase. — user19695124, Aug 23 '23 at 16:35
What is the number of input files? With roughly even files of 128KB, it would be 27K files. Is that the case? On output, it should generate 10K files, except when you limit number of rows. Is the input filesystem same as output filesystem? Or maybe you are reading from local disk vs. writing on HDFS? — xenodevil, Aug 24 '23 at 09:06
Also, can you try going for a smaller number of partitions? Compression will help more if you are writing fewer files. And what is the format and compression on input files? Is that also ORC/Snappy? — xenodevil, Aug 24 '23 at 09:08
All files are being read and written on the same filesystem (local disk) using orc with snappy compression to ensure everything is 1-1. — user19695124, Aug 24 '23 at 13:44
I'll try a smaller number of partitions (5000?) using coalesce and repartition. — user19695124, Aug 24 '23 at 13:50

How can I optimize orc snappy compression in spark?

0 Answers0