How to uncompress file while loading from HDFS to S3?

Question

I have csv files in lzo format in HDFS I would like to load these files in to s3 and then to snowflake, as snowflake does not provides lzo compression for csv file format, I am required to convert it on the fly while loading these files to s3.

If you're using s3distcp, you can specify output compression codec https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html? — mazaneicha, May 21 '20 at 12:58
@mazaneicha thanks for the response, can I use s3distcp outside of EMR? — Vishrant, May 21 '20 at 14:43

score 0 · Answer 1 · answered May 20 '20 at 21:00

0

You can consider using a Lambda function that decompresses the files upon landing on s3, here is a link that gets you there:

https://medium.com/@johnpaulhayes/how-extract-a-huge-zip-file-in-an-amazon-s3-bucket-by-using-aws-lambda-and-python-e32c6cf58f06

answered May 20 '20 at 21:00

Rich Murnane

2,697
1
11
23

sorry, this is not an option for my use case, as this is part of a pipeline, can't add another step for Lambda. Thanks for the suggestion though. – Vishrant May 20 '20 at 23:58

score 0 · Answer 2 · answered May 21 '20 at 19:10

This answer helped me to convert from .lzo_deflate to required snowflake compatible output format:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input <input-path> \
  -output $OUTPUT \
  -mapper "cut -f 2"

How to uncompress file while loading from HDFS to S3?

2 Answers2