33

I have a 17.7GB file on S3. It was generated as the output of a Hive query, and it isn't compressed.

I know that by compressing it, it'll be about 2.2GB (gzip). How can I download this file locally as quickly as possible when transfer is the bottleneck (250kB/s).

I've not found any straightforward way to compress the file on S3, or enable compression on transfer in s3cmd, boto, or related tools.

Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
  • 2
    Do you have the ability to regenerate this file by rerunning your Hive query? If yes i would advise to enable output compression for your Hive query. – Charles Menguy Jan 24 '13 at 06:28
  • @CharlesMenguy: I actually did this the first time (I think). However there was an `order by` in the statement, and this affected the output. Normally I'd get a file for each map job, but instead I got a single file from the reduce which I assume is where the ordering was done. – Matt Joiner Jan 24 '13 at 06:30
  • How did you enable output compression in your query? I think you should be able to compress the output of pretty much any Hive query regardless of whether there is an `order by` or not. I assume you're writing to S3 by doing an `insert overwrite directory 's3n://...'`, right? – Charles Menguy Jan 24 '13 at 06:34
  • http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-output-compression.html – Barbaros Alp May 31 '16 at 13:42
  • Also see [Serving compressed files using CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/ServingCompressedFiles.html) – djvg Jul 22 '22 at 07:07

3 Answers3

30

S3 does not support stream compression nor is it possible to compress the uploaded file remotely.

If this is a one-time process I suggest downloading it to a EC2 machine in the same region, compress it there, then upload to your destination.

http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EC2_GetStarted.html

If you need this more frequently

Serving gzipped CSS and JavaScript from Amazon CloudFront via S3

Community
  • 1
  • 1
Michel Feldheim
  • 17,625
  • 5
  • 60
  • 77
14

Late answer but I found this working perfectly.

aws s3 sync s3://your-pics .

for file in "$(find . -name "*.jpg")"; do gzip "$file"; echo "$file";  done

aws s3 sync . s3://your-pics --content-encoding gzip --dryrun

This will download all files in s3 bucket to the machine (or ec2 instance), compresses the image files and upload them back to s3 bucket. Verify the data before removing dryrun flag.

sj26
  • 6,725
  • 2
  • 27
  • 24
Navaneeth Pk
  • 602
  • 7
  • 14
3

There are now pre-built apps in Lambda that you could use to compress images and files in S3 buckets. So just create a new Lambda function and select a pre-built app of your choice and complete the configuration.

  1. Step 1 - Create a new Lambda function
  2. Step 2 - Search for prebuilt app enter image description here
  3. Step 3 - Select the app that suits your need and complete the configuration process by providing the S3 bucket names. enter image description here
CloudArch
  • 291
  • 2
  • 3