Why does compression in YARN slow down the job by several times?

Question

When I run the job in YARN (2.4.0) using the compression (snappy), there is a big impact on the job completion time. For example, I ran the following experiments. Job: invertedindex Cluster: 10 slaves VMs(4 CPU 8GB RAM).

Job completion time of 5GB invertedindex without compression(snappy): 226s, with compression: 1600s

Job completion time of 50GB invertedindex without compression(snappy): 2000s, with compression: 14000s

My configuration in mapred-site.xml is like this:

<name>mapreduce.map.output.compress</name>  
  <value>true</value>
</property>
<property>
  <name>mapred.map.output.compress.codec</name>  
  <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

I have read a lot of material that says that the compression should improve the performance, but here it has slowed down the job by almost 7 times. What am I doing wrong here?

score 0 · Answer 1 · answered Jul 22 '14 at 07:29

0

It might be the default setting of mapreduce.output.fileoutputformat.compress.type which is set to RECORD.

Basically it tries to compress every record, if your records are small text snippets (e.g. a token in your inverted index) it might end up in a larger size than it was before.

You can try to set this property to BLOCK, which should compress on a block-level, giving a better compression over redundant text data.

answered Jul 22 '14 at 07:29

Thomas Jungblut

20,854
6
68
91

Thanks Thomas. But I didn't compress the final output, I just compressed the intermediate data. Do you mean set this property to BLOCK as well even though I don't compress the final output? By the way, do I need to install snappy by my self? Is that possible that something is wrong with the hadoop native library? – Zeroun Jul 22 '14 at 13:36
I try the BLOCK configuration, it doesn't help. – Zeroun Jul 22 '14 at 19:29
Hi Thomas, this problem caused by the native library, thank you. – Zeroun Jul 23 '14 at 01:52
@Zeroun it depends on what Snappy implementation you use, there are libraries that don't require native libs and are still comparable in speed. – Thomas Jungblut Jul 23 '14 at 07:09

score 0 · Answer 2 · edited May 23 '17 at 11:57

I fixed this compression problem by following steps:

1, fix the problem of “Unable to load native-hadoop library” Hadoop "Unable to load native-hadoop library for your platform" warning

2, install snappy http://code.google.com/p/snappy/

3, copy /usr/local/lib/libsnappy* to $HADOOP_HOME/lib/native/

4, configure the LD_LIBRARY_PATH in hadoop-env.sh and in mapred-site.xml

<property>  
    <name>mapred.child.env</name>  
    <value>LD_LIBRARY_PATH=$HADOOP_HOME/lib/native</value>  
</property

Why does compression in YARN slow down the job by several times?

2 Answers2