13

I have zip files uploaded to S3. I'd like to download them for processing. I don't need to permanently store them, but I need to temporarily process them. How would I go about doing this?

user1802143
  • 14,662
  • 17
  • 46
  • 55
  • If you want to simply download it without extracting any file, it's also possible to use the `download_file` method as shown in this answer: https://stackoverflow.com/a/71474927/11764049 – Aelius Mar 24 '23 at 09:48

6 Answers6

31

Because working software > comprehensive documentation:

Boto2

import zipfile
import boto
import io

# Connect to s3
# This will need your s3 credentials to be set up 
# with `aws configure` using the aws CLI.
#
# See: https://aws.amazon.com/cli/
conn = boto.s3.connect_s3()

# get hold of the bucket
bucket = conn.get_bucket("my_bucket_name")

# Get hold of a given file
key = boto.s3.key.Key(bucket)
key.key = "my_s3_object_key"

# Create an in-memory bytes IO buffer
with io.BytesIO() as b:

    # Read the file into it
    key.get_file(b)

    # Reset the file pointer to the beginning
    b.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(b, mode='r') as zipf:
        for subfile in zipf.namelist():
            do_stuff_with_subfile()

Boto3

import zipfile
import boto3
import io

# this is just to demo. real use should use the config 
# environment variables or config file.
#
# See: http://boto3.readthedocs.org/en/latest/guide/configuration.html

session = boto3.session.Session(
    aws_access_key_id="ACCESSKEY", 
    aws_secret_access_key="SECRETKEY"
)

s3 = session.resource("s3")
bucket = s3.Bucket('stackoverflow-brice-test')
obj = bucket.Object('smsspamcollection.zip')

with io.BytesIO(obj.get()["Body"].read()) as tf:

    # rewind the file
    tf.seek(0)

    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)

Tested on MacOSX with Python3.

brice
  • 24,329
  • 7
  • 79
  • 95
  • Thank you for your answer. Do you know any way to do this on boto3? – jaycode Mar 20 '16 at 18:26
  • @brice getting 'no such file or directory' when i try to actually `with open(subfile, 'r') as file:` – partydog Mar 18 '19 at 18:28
  • I do not believe this method will work with very large (~>2GB) zip files. You will get a "Python int too large to convert to C long" when you try to read the zip file with the line "with io.BytesIO(obj.get()["body].read()) as tf:" I have been unable to find a reliable way to open a S3 zip file that is larger than 2GB. – Doug Bower Apr 08 '21 at 06:23
  • @partydog that is because it is just printing the names of the files within the zip file. – Binx Apr 26 '22 at 20:02
  • How can we read this in pandas? I tried adding the subfile as a parameter but it throws the following error - FileNotFoundError: [Errno 2] No such file or directory: – Mohseen Mulla Dec 20 '22 at 08:55
  • Found the answer to my own question - If it helps anyone with zipfile.ZipFile(tf, mode='r') as zipf: for line in zipf.read("xyz.csv").split(b"\n"): print(line) – Mohseen Mulla Dec 20 '22 at 09:15
4

Pandas provides a shortcut for this, which removes most of the code from the top answer, and allows you to be agnostic about whether your file path is on s3, gcp, or your local machine.

import pandas as pd  

obj = pd.io.parsers.get_filepath_or_buffer(file_path)[0]
with io.BytesIO(obj.read()) as byte_stream:
    # Use your byte stream, to, for example, print file names...
    with zipfile.ZipFile(byte_stream, mode='r') as zipf:
        for subfile in zipf.namelist():
            print(subfile)
Teddy Ward
  • 450
  • 3
  • 16
3

If speed is a concern, a good approach would be to choose an EC2 instance fairly close to your S3 bucket (in the same region) and use that instance to unzip/process your zipped files.

This will allow for a latency reduction and allow you to process them fairly efficiently. You can remove each extracted file after finishing your work.

Note: This will only work if you are fine using EC2 instances.

DanGar
  • 3,018
  • 17
  • 17
1

I believe you have heard boto which is Python interface to Amazon Web Services

You can get key from s3 to file.

import boto
import zipfile.ZipFile as ZipFile

s3 = boto.connect_s3() # connect
bucket = s3.get_bucket(bucket_name) # get bucket
key = bucket.get_key(key_name) # get key (the file in s3)
key.get_file(local_name) # set this to temporal file

with ZipFile(local_name, 'r') as myzip:
    # do something with myzip

os.unlink(local_name) # delete it

You can also use tempfile. For more detail, see create & read from tempfile

Community
  • 1
  • 1
emesday
  • 6,078
  • 3
  • 29
  • 46
1

Reading certain file from a zip file from S3 bucket.

import boto3
import os
import zipfile
import io
import json


'''
When you configure awscli, you\'ll set up a credentials file located at 
~/.aws/credentials. By default, this file will be used by Boto3 to authenticate.
'''
os.environ['AWS_PROFILE'] = "<profile_name>"
os.environ['AWS_DEFAULT_REGION'] = "<region_name>"

# Let's use Amazon S3
s3_name = "<bucket_name>"
zip_file_name = "<zip_file_name>"
file_to_open = "<file_to_open>"
s3 = boto3.resource('s3')
obj = s3.Object(s3_name, zip_file_name )

with io.BytesIO(obj.get()["Body"].read()) as tf:
    # rewind the file
    tf.seek(0)
    # Read the file as a zipfile and process the members
    with zipfile.ZipFile(tf, mode='r') as zipf:
        file_contents= zipf.read(file_to_open).decode("utf-8")
        print(file_contents)

reference from @brice answer.

nirojshrestha019
  • 2,068
  • 1
  • 10
  • 14
0

Adding on to @brice answer


Here is the code if you want to read any data inside the file line by line

with zipfile.ZipFile(tf, mode='r') as zipf:
    for line in zipf.read("xyz.csv").split(b"\n"):
        print(line)
        break # to break off after the first line

Hope this helps!

Mohseen Mulla
  • 542
  • 7
  • 15