I have csv files in lzo
format in HDFS
I would like to load these files in to s3 and then to snowflake, as snowflake does not provides lzo compression for csv file format, I am required to convert it on the fly while loading these files to s3.
Asked
Active
Viewed 246 times
0

Vishrant
- 15,456
- 11
- 71
- 120
-
If you're using s3distcp, you can specify output compression codec https://docs.aws.amazon.com/emr/latest/ReleaseGuide/UsingEMR_s3distcp.html? – mazaneicha May 21 '20 at 12:58
-
@mazaneicha thanks for the response, can I use s3distcp outside of EMR? – Vishrant May 21 '20 at 14:43
2 Answers
0
You can consider using a Lambda function that decompresses the files upon landing on s3, here is a link that gets you there:

Rich Murnane
- 2,697
- 1
- 11
- 23
-
sorry, this is not an option for my use case, as this is part of a pipeline, can't add another step for Lambda. Thanks for the suggestion though. – Vishrant May 20 '20 at 23:58
0
This answer helped me to convert from .lzo_deflate
to required snowflake compatible output format:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \
-Dmapred.output.compress=true \
-Dmapred.compress.map.output=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec \
-Dmapred.reduce.tasks=0 \
-input <input-path> \
-output $OUTPUT \
-mapper "cut -f 2"

Vishrant
- 15,456
- 11
- 71
- 120