So, Spark has the file spark-defaults.xml
for specifying what settings, including which compression codec is to used and at what stage (RDD, Shuffle). Most of the settings can be set at the application level.
EDITED:
conf = SparkConf()
conf.set("spark.hadoop.mapred.output.compress", "true")
conf.set("spark.hadoop.mapred.output.compression.codec", "org.apache.hadoop.io.compress.snappy")
How can I use spark-defaults.xml
to tell Spark to use a particular codec to compress Spark outputs only?
Option 1
spark.hadoop.mapred.output.compress true
spark.hadoop.mapred.output.compression.codec snappy
Option 2:
spark.mapreduce.output.fileoutputformat.compress true
spark.mapreduce.output.fileoutputformat.compress.codec snappy
Option 3:
mapreduce.output.fileoutputformat.compress true
mapreduce.output.fileoutputformat.compress.codec snappy
Anyone has the proper way to sethe this (from any of these options or something similar)? I am running Spark 1.6.1.