Compression codec detection in Hadoop from the command line

Question

Is there any simple way to find out the codec used to compress a file in Hadoop?

Do I need to write a Java program, or add the file to Hive so I can use describe formatted table?

Jakub Kukul · Answer 1 · 2017-10-25T10:07:48.327

One way to do it is to download a file locally (using hdfs dfs -get command) and then follow the procedure for detecting compression format for local files.

This should work quite well for files compressed outside of Hadoop. For files generated within Hadoop, this will work only for limited number of cases, e.g. text files compressed with Gzip.

Files compressed within Hadoop are likely to be so called "container formats", e.g. Avro, Sequence Files, Parquet, etc. That means that not the entire file is compressed, but only chunks of data inside the file. The hive's describe formatted table command that you're mentioning can indeed help you to figure out the input format of underlying files.

Once you know the file format, you should refer to the documentation/source code of the file format for the reference on codec detection. Some file formats even come with command line tools to look into the file's metadata which reveals the compression codec. Some examples:

Avro:

hadoop jar /path/to/avro-tools.jar getmeta FILE_LOCATION_ON_HDFS --key 'avro.codec'

Parquet

hadoop jar /path/to/parquet-tools.jar meta FILE_LOCATION_ON_HDFS

PixelCloudSt · Answer 2 · 2015-01-14T07:44:55.360

If you are asking what codec is being used by mapreduce for intermediate map output and/or final output you can check Hadoop's configuration file, typically located at <HADOOP_HOME>/etc/mapred-site.xml. I am not, however, aware of a way to check directly from the command line.

Settings for intermediate map output compression should look something like below:

<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
</property>

<property>
<name>mapreduce.map.output.compress.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

Settings for job output compression should look something like below:

<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
</property>

<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>BLOCK</value>
</property>

<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>org.apache.hadoop.io.compress.GzipCodec</value>
</property>

From those two snippets, you can see that I'm using the GZIP codec, and that I'm compressing both the intermediate map output as well as the final output. Hope that helps!

Compression codec detection in Hadoop from the command line

2 Answers2

Linked