Compress file on S3

Question

I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't compressed.

I know that by compressing it, it'll be about 2.2GB (gzip). How can I download this file locally as quickly as possible when transfer is the bottleneck (250kB/s).

I've not found any straightforward way to compress the file on S3, or enable compression on transfer in s3cmd, boto, or related tools.

Do you have the ability to regenerate this file by rerunning your Hive query? If yes i would advise to enable output compression for your Hive query. — Charles Menguy, Jan 24 '13 at 06:28
@CharlesMenguy: I actually did this the first time (I think). However there was an `order by` in the statement, and this affected the output. Normally I'd get a file for each map job, but instead I got a single file from the reduce which I assume is where the ordering was done. — Matt Joiner, Jan 24 '13 at 06:30
How did you enable output compression in your query? I think you should be able to compress the output of pretty much any Hive query regardless of whether there is an `order by` or not. I assume you're writing to S3 by doing an `insert overwrite directory 's3n://...'`, right? — Charles Menguy, Jan 24 '13 at 06:34
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-output-compression.html — Barbaros Alp, May 31 '16 at 13:42
Also see [Serving compressed files using CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html) — djvg, Jul 22 '22 at 07:07

score 30 · Accepted Answer · edited May 23 '17 at 12:18

30

S3 does not support stream compression nor is it possible to compress the uploaded file remotely.

If this is a one-time process I suggest downloading it to a EC2 machine in the same region, compress it there, then upload to your destination.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html

If you need this more frequently

Serving gzipped CSS and JavaScript from Amazon CloudFront via S3

edited May 23 '17 at 12:18

Community

1
1

answered Jan 24 '13 at 06:35

Michel Feldheim

17,625
5
60
77

1

Is this still true? – static_rtti Jul 04 '19 at 16:35
3

Its still true, that you can not in-place modify files on s3. You can stream compressed files into s3 and compress the output after reading – Michel Feldheim Jul 05 '19 at 12:43

score 14 · Answer 2 · edited Jan 27 '20 at 03:01

14

Late answer but I found this working perfectly.

aws s3 sync s3://your-pics .

for file in "$(find . -name "*.jpg")"; do gzip "$file"; echo "$file";  done

aws s3 sync . s3://your-pics --content-encoding gzip --dryrun

This will download all files in s3 bucket to the machine (or ec2 instance), compresses the image files and upload them back to s3 bucket. Verify the data before removing dryrun flag.

edited Jan 27 '20 at 03:01

sj26

6,725
2
27
24

answered Dec 04 '17 at 13:21

Navaneeth Pk

602
7
14

Sorry, I just can't pass up 'find' usage. Instead of grep you should use: `find ./ -name "*.jpg"` – tamerlaha Dec 19 '19 at 18:32
@tamerlaha why use find instead of grep? – Quinn Vissak Jul 22 '20 at 20:40
because `grep` isn't suitable for searching files by extension. – tamerlaha Jul 23 '20 at 21:05
1

You don't really need to use find or for in this case. You can just do `gzip *.jpg` or `gzip -9 *.jpg` to ensure highest compression. It will gzip all the jpg files in the current directory. – Hayden Jun 17 '21 at 21:08

score 3 · Answer 3 · answered Jul 08 '21 at 11:42

There are now pre-built apps in Lambda that you could use to compress images and files in S3 buckets. So just create a new Lambda function and select a pre-built app of your choice and complete the configuration.

Step 1 - Create a new Lambda function
Step 2 - Search for prebuilt app
Step 3 - Select the app that suits your need and complete the configuration process by providing the S3 bucket names.

Compress file on S3

3 Answers3