spark.rdd.compress and its effect for saving tables

Question

Coming from questions like this and this one I asked myself if spark.rdd.compress has also an effect when I save a dataframe, which is partitioned on RDD-level, to a (for example) parquet-table.
Or maybe in other words: Does spark.rdd.compress also compress the table I create when I use dataframe.write.saveAsTable(...)?

Taken from the docs, spark.rdd.compress does the following:

Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER in Java and Scala or StorageLevel.MEMORY_ONLY in Python). Can save substantial space at the cost of some extra CPU time. Compression will use spark.io.compression.codec.

So, additionally, if such compression works, will it also cost additional CPU to retrieve data again from such a table?

score 2 · Accepted Answer · answered Jun 06 '19 at 10:58

Does spark.rdd.compress also compress the table I create when I use dataframe.write.saveAsTable(...)

It won't, and neither it will for RDD sinks.

As stated in the documentation you quote, it is applicable only for serialized (_SER) caching. It has nothing to do with external storage.

spark.rdd.compress and its effect for saving tables

1 Answers1