3

So, Spark has the file spark-defaults.xml for specifying what settings, including which compression codec is to used and at what stage (RDD, Shuffle). Most of the settings can be set at the application level.

EDITED:

conf = SparkConf() conf.set("spark.hadoop.mapred.output.compress", "true") conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.snappy")

How can I use spark-defaults.xml to tell Spark to use a particular codec to compress Spark outputs only?

Option 1 spark.hadoop.mapred.output.compress true spark.hadoop.mapred.output.compression.codec snappy

Option 2: spark.mapreduce.output.fileoutputformat.compress true spark.mapreduce.output.fileoutputformat.compress.codec snappy

Option 3: mapreduce.output.fileoutputformat.compress true mapreduce.output.fileoutputformat.compress.codec snappy

Anyone has the proper way to sethe this (from any of these options or something similar)? I am running Spark 1.6.1.

nikk
  • 2,627
  • 5
  • 30
  • 51

2 Answers2

4

You should add this to your spark-defaults.xml:

<property>
    <name>spark.hadoop.mapred.output.compress</name>
    <value>true</value>
</property>
<property>
    <name>spark.hadoop.mapred.output.compression.codec</name>
    <value>snappy</value>
</property>

This is the same as adding these in the spark-submit command:

--conf spark.hadoop.mapred.output.compress=true
--conf spark.hadoop.mapred.output.compression.codec=snappy
ronhash
  • 854
  • 7
  • 16
1

Spark compression is explained at the following link: http://spark.apache.org/docs/latest/configuration.html#compression-and-serialization

According to this, you can configure lz4, lzf, or snappy compression as

spark.io.compression.codec     lz4

Or

spark.io.compression.codec     org.apache.spark.io.LZ4CompressionCodec

in the conf/spark-defaults.conf configuration file. This file is used to specify default configuration for your jobs and its executors which is going to be run on worker nodes.

Malemi
  • 49
  • 3
  • Not quite what I was asking about. Edited question to provide clarity. – nikk Aug 15 '16 at 00:57
  • 1
    Every configuration can be set in SparkConf can also be set in conf/spark-defaults.conf – Malemi Aug 16 '16 at 11:04
  • But what if you need to compress only output RDD, and not input nor intermediate RDDs from shuffle phase? – nikk Dec 12 '16 at 04:54
  • 2
    You can call the saveAsTextFile function on RDD and give your compression codec as the second argument – Malemi Dec 24 '16 at 15:35
  • [looking here](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.io.CompressionCodec): "The wire protocol for a codec is not guaranteed compatible across versions of Spark. This is intended for use as an internal compression utility within a single Spark application." The way I read this... serializing to disk with one of these codecs (snappy, lz4, lzf, zstd) is a bad idea, right? but I'm guessing using `org.apache.hadoop.io.compress.GzipCodec` would be ok? – kmh Sep 14 '18 at 19:50