Hadoop streaming flat-files to gzip

Question

I've been trying to gzip files (pipe separated csv) in hadoop using the hadoop-streaming.jar. I've found the following thread on stackoverflow: Hadoop: compress file in HDFS? and I tried both solutions (cat/cut for the mapper). Although I end up with a gzipped file in HDFS it also now has a tab character at the end of each line. Any ideas how to get rid of these? The tab at the end is messing up my last column.

I've tried the following two commands (in lots of flavours):

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input <filename> \
  -output <output-path> \
  -mapper "cut -f 2"

and

hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-streaming.jar \
  -Dmapred.reduce.tasks=0 \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -input <filename> \
  -output <output-path> \
  -mapper /bin/cat \
  -inputformat org.apache.hadoop.mapred.TextInputFormat \
  -outputformat org.apache.hadoop.mapred.TextOutputFormat

I know that mapreduce outputs a key-value par that is tab separated but the "cut -f 2" (also tried "cut -f 2 -d,") should only return the value part, not the tab. So why does every line ends with a tab?

I hope someone can enlighten me.

Hadoop streaming flat-files to gzip

0 Answers0