0

I have csv files in lzo format in HDFS I would like to load these files in to s3 and then to snowflake, as snowflake does not provides lzo compression for csv file format, I am required to convert it on the fly while loading these files to s3.

Vishrant
  • 15,456
  • 11
  • 71
  • 120
  • If you're using s3distcp, you can specify output compression codec https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html? – mazaneicha May 21 '20 at 12:58
  • @mazaneicha thanks for the response, can I use s3distcp outside of EMR? – Vishrant May 21 '20 at 14:43

2 Answers2

0

You can consider using a Lambda function that decompresses the files upon landing on s3, here is a link that gets you there:

https://medium.com/@johnpaulhayes/how-extract-a-huge-zip-file-in-an-amazon-s3-bucket-by-using-aws-lambda-and-python-e32c6cf58f06

Rich Murnane
  • 2,697
  • 1
  • 11
  • 23
  • sorry, this is not an option for my use case, as this is part of a pipeline, can't add another step for Lambda. Thanks for the suggestion though. – Vishrant May 20 '20 at 23:58
0

This answer helped me to convert from .lzo_deflate to required snowflake compatible output format:

hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
  -Dmapred.output.compress=true \
  -Dmapred.compress.map.output=true \
  -Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
  -Dmapred.reduce.tasks=0 \
  -input <input-path> \
  -output $OUTPUT \
  -mapper "cut -f 2"
Vishrant
  • 15,456
  • 11
  • 71
  • 120