Which setting to use in Spark to specify compression of `Output`?

Question

So, Spark has the file spark-defaults.xml for specifying what settings, including which compression codec is to used and at what stage (RDD, Shuffle). Most of the settings can be set at the application level.

EDITED:

conf = SparkConf() conf.set("spark.hadoop.mapred.output.compress", "true") conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.snappy")

How can I use spark-defaults.xml to tell Spark to use a particular codec to compress Spark outputs only?

Option 1 spark.hadoop.mapred.output.compress true spark.hadoop.mapred.output.compression.codec snappy

Option 2: spark.mapreduce.output.fileoutputformat.compress true spark.mapreduce.output.fileoutputformat.compress.codec snappy

Option 3: mapreduce.output.fileoutputformat.compress true mapreduce.output.fileoutputformat.compress.codec snappy

Anyone has the proper way to sethe this (from any of these options or something similar)? I am running Spark 1.6.1.

score 4 · Answer 1 · answered May 24 '17 at 14:50

You should add this to your spark-defaults.xml:

<property>
    <name>spark.hadoop.mapred.output.compress</name>
    <value>true</value>
</property>
<property>
    <name>spark.hadoop.mapred.output.compression.codec</name>
    <value>snappy</value>
</property>

This is the same as adding these in the spark-submit command:

--conf spark.hadoop.mapred.output.compress=true
--conf spark.hadoop.mapred.output.compression.codec=snappy

score 1 · Answer 2 · answered Aug 14 '16 at 07:25

1

Spark compression is explained at the following link: http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization

According to this, you can configure lz4, lzf, or snappy compression as

spark.io.compression.codec     lz4

Or

spark.io.compression.codec     org.apache.spark.io.LZ4CompressionCodec

in the conf/spark-defaults.conf configuration file. This file is used to specify default configuration for your jobs and its executors which is going to be run on worker nodes.

answered Aug 14 '16 at 07:25

Malemi

49
3

Not quite what I was asking about. Edited question to provide clarity. – nikk Aug 15 '16 at 00:57
1

Every configuration can be set in SparkConf can also be set in conf/spark-defaults.conf – Malemi Aug 16 '16 at 11:04
But what if you need to compress only output RDD, and not input nor intermediate RDDs from shuffle phase? – nikk Dec 12 '16 at 04:54
2

You can call the saveAsTextFile function on RDD and give your compression codec as the second argument – Malemi Dec 24 '16 at 15:35
[looking here](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.io.CompressionCodec): "The wire protocol for a codec is not guaranteed compatible across versions of Spark. This is intended for use as an internal compression utility within a single Spark application." The way I read this... serializing to disk with one of these codecs (snappy, lz4, lzf, zstd) is a bad idea, right? but I'm guessing using `org.apache.hadoop.io.compress.GzipCodec` would be ok? – kmh Sep 14 '18 at 19:50

Which setting to use in Spark to specify compression of `Output`?

EDITED:

2 Answers2