Coming from questions like this and this one I asked myself if spark.rdd.compress
has also an effect when I save a dataframe, which is partitioned on RDD-level, to a (for example) parquet-table.
Or maybe in other words: Does spark.rdd.compress
also compress the table I create when I use dataframe.write.saveAsTable(...)
?
Taken from the docs, spark.rdd.compress does the following:
Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER in Java and Scala or StorageLevel.MEMORY_ONLY in Python). Can save substantial space at the cost of some extra CPU time. Compression will use spark.io.compression.codec.
So, additionally, if such compression works, will it also cost additional CPU to retrieve data again from such a table?