119

I read the filenames in my S3 bucket by doing

objs = boto3.client.list_objects(Bucket='my_bucket')
    while 'Contents' in objs.keys():
        objs_contents = objs['Contents']
        for i in range(len(objs_contents)):
            filename = objs_contents[i]['Key']

Now, I need to get the actual content of the file, similarly to a open(filename).readlines(). What is the best way?

mar tin
  • 9,266
  • 23
  • 72
  • 97

10 Answers10

148

boto3 offers a resource model that makes tasks like iterating through objects easier. Unfortunately, StreamingBody doesn't provide readline or readlines.

s3 = boto3.resource('s3')
bucket = s3.Bucket('test-bucket')
# Iterates through all the objects, doing the pagination for you. Each obj
# is an ObjectSummary, so it doesn't contain the body. You'll need to call
# get to get the whole body.
for obj in bucket.objects.all():
    key = obj.key
    body = obj.get()['Body'].read()
Jordon Phillips
  • 14,963
  • 4
  • 35
  • 42
  • 2
    I passed through the client because I need to configure it manually within the script itself, as in client = boto3.client( 's3', aws_access_key_id="***", aws_secret_access_key="****" ). Is there a way to give the access keys to the resource without using the client? – mar tin Mar 24 '16 at 19:33
  • 3
    You can configure the resource in the same way. – Jordon Phillips Mar 24 '16 at 20:01
  • 4
    How do I read a file if it is in folders in S3. So for eg my bucket name is A. Now A has a folder B. B has a folder C. C contains a file Readme.csv. How to read this file. Your solution is good if we have files directly in bucket but in case we have multiple folders then how to go about it. Thanks. – Kshitij Marwah Dec 14 '16 at 16:56
  • S3 is an object store, not a file system. It doesn't actually have the concept of folders, though it is commonly stapled on. When iterating over objects you will get everything unless you specify otherwise. – Jordon Phillips Dec 14 '16 at 17:06
  • 2
    we can get the body, how can i read line by line within this body ? – Gabriel Wu Mar 02 '17 at 06:29
  • @GabrielWu did you find a way? – Adi Apr 01 '17 at 19:04
  • Do I really need to get all the objects until I find the one I need? – Iulian Onofrei May 02 '17 at 20:22
  • @IulianOnofrei you can filter in your listing, and there is no requirement that you call get on the object if you don't need to. – Jordon Phillips May 02 '17 at 22:36
  • Isn't `bucket.objects.all()` making requests to AWS while iterating it? – Iulian Onofrei May 03 '17 at 07:22
  • 11
    @IulianOnofrei it is making requests yes, but you aren't downloading the objects, just listing them. You can use `.filter()` to make fewer list requests. Or if you know the key you want just get it directly with `bucket.Object('mykey')` – Jordon Phillips May 03 '17 at 15:40
  • The `bucket.Object('mykey')` was exactly what I needed, thanks! – Iulian Onofrei May 03 '17 at 18:27
  • @martin you can configure aws cli or put you both keys in ~/.aws/credentials file, that way you don't have to specify in the script. It will take from the running machine or environment. – Vivek Feb 18 '18 at 14:18
  • @Jordon Phillips Would you know how to use your code to actually put all the read files into data frames? I mean if I have a bucket that has got two csv files I want to combine them into one. – Kalenji Nov 04 '20 at 09:15
48

Using the client instead of resource:

s3 = boto3.client('s3')
bucket='bucket_name'
result = s3.list_objects(Bucket = bucket, Prefix='/something/')
for o in result.get('Contents'):
    data = s3.get_object(Bucket=bucket, Key=o.get('Key'))
    contents = data['Body'].read()
    print(contents.decode("utf-8"))
Ryan M
  • 18,333
  • 31
  • 67
  • 74
Climbs_lika_Spyder
  • 6,004
  • 3
  • 39
  • 53
41

You might consider the smart_open module, which supports iterators:

from smart_open import smart_open

# stream lines from an S3 object
for line in smart_open('s3://mybucket/mykey.txt', 'rb'):
    print(line.decode('utf8'))

and context managers:

with smart_open('s3://mybucket/mykey.txt', 'rb') as s3_source:
    for line in s3_source:
         print(line.decode('utf8'))

    s3_source.seek(0)  # seek to the beginning
    b1000 = s3_source.read(1000)  # read 1000 bytes

Find smart_open at https://pypi.org/project/smart_open/

caffreyd
  • 1,151
  • 1
  • 17
  • 25
24

When you want to read a file with a different configuration than the default one, feel free to use either mpu.aws.s3_read(s3path) directly or the copy-pasted code:

def s3_read(source, profile_name=None):
    """
    Read a file from an S3 source.

    Parameters
    ----------
    source : str
        Path starting with s3://, e.g. 's3://bucket-name/key/foo.bar'
    profile_name : str, optional
        AWS profile

    Returns
    -------
    content : bytes

    botocore.exceptions.NoCredentialsError
        Botocore is not able to find your credentials. Either specify
        profile_name or add the environment variables AWS_ACCESS_KEY_ID,
        AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN.
        See https://boto3.readthedocs.io/en/latest/guide/configuration.html
    """
    session = boto3.Session(profile_name=profile_name)
    s3 = session.client('s3')
    bucket_name, key = mpu.aws._s3_path_split(source)
    s3_object = s3.get_object(Bucket=bucket_name, Key=key)
    body = s3_object['Body']
    return body.read()
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
18

If you already know the filename, you can use the boto3 builtin download_fileobj

import boto3

from io import BytesIO

session = boto3.Session()
s3_client = session.client("s3")

f = BytesIO()
s3_client.download_fileobj("bucket_name", "filename", f)
print(f.getvalue())
reubano
  • 5,087
  • 1
  • 42
  • 41
  • 2
    `f.seek(0)` is unnecessary with a BytesIO (or StringIO) object. `read` starts at the current position, but `getvalue` always reads from position 0. – Adam Hoelscher May 26 '20 at 20:01
  • Good point @adam. There's the chance that someone will actually need `read` for their use case. I only used `getvalue` for demonstrative purposes. – reubano May 26 '20 at 20:16
4
import boto3

print("started")

s3 = boto3.resource('s3',region_name='region_name', aws_access_key_id='your_access_id', aws_secret_access_key='your access key')

obj = s3.Object('bucket_name','file_name')

data=obj.get()['Body'].read()

print(data)
  • 2
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 17 '22 at 07:44
1

This is the correct and tested code to access the file contents using boto3 from the s3 bucket. It is working for me till the date of posting.

def get_file_contents(bucket, prefix):
    s3 = boto3.resource('s3')
    s3.meta.client.meta.events.register('choose-signer.s3.*', disable_signing)
    bucket = s3.Bucket(bucket)
    for obj in bucket.objects.filter(Prefix=prefix):
        key = obj.key
        body = obj.get()['Body'].read()
        print(body)
        return body

get_file_contents('coderbytechallengesandbox', '__cb__')
bilalmohib
  • 280
  • 3
  • 16
0

the best way for me is this:

result = s3.list_objects(Bucket = s3_bucket, Prefix=s3_key)
for file in result.get('Contents'):
    data = s3.get_object(Bucket=s3_bucket, Key=file.get('Key'))
    contents = data['Body'].read()
    #if Float types are not supported with dynamodb; use Decimal types instead
    j = json.loads(contents, parse_float=Decimal)
    for item in j:
       timestamp = item['timestamp']

       table.put_item(
           Item={
            'timestamp': timestamp
           }
      )

once you have the content you can run it through another loop to write it to a dynamodb table for instance ...

aerioeus
  • 1,348
  • 1
  • 16
  • 41
0

An alternative to boto3 in this particular case is s3fs.

from s3fs import S3FileSystem
s3 = S3FileSystem()
bucket = 's3://your-bucket'

def read_file(key):
    with s3.open(f'{s3_path}/{key}', 'r') as file:  # s3://bucket/file.txt
        return file.readlines()

for obj in bucket.objects.all():
    key = obj.key
    lines = read_file(key)
    ...
Manuel Montoya
  • 1,206
  • 13
  • 25
0

Please note that Boto3 now stopped updates to Resources and the recommend approach now is to go back using Client.

So, I believe answer from @Climbs_lika_Spyder should now be the accepted answer.

Reference: https://boto3.amazonaws.com/v1/documentation/api/latest/guide/resources.html

Warning: The AWS Python SDK team is no longer planning to support the resources interface in boto3. Requests for new changes involving resource models will no longer be considered, and the resources interface won't be supported in the next major version of the AWS SDK for Python. The AWS SDK teams are striving to achieve more consistent functionality among SDKs, and implementing customized abstractions in individual SDKs is not a sustainable solution going forward. Future feature requests will need to be considered at the cross-SDK level.

Eric Aya
  • 69,473
  • 35
  • 181
  • 253