Python boto, list contents of specific dir in bucket

Question

I have S3 access only to a specific directory in an S3 bucket.

For example, with the s3cmd command if I try to list the whole bucket:

    $ s3cmd ls s3://bucket-name

I get an error: Access to bucket 'my-bucket-url' was denied

But if I try access a specific directory in the bucket, I can see the contents:

    $ s3cmd ls s3://bucket-name/dir-in-bucket

Now I want to connect to the S3 bucket with python boto. Similary with:

    bucket = conn.get_bucket('bucket-name')

I get an error: boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden

But if I try:

    bucket = conn.get_bucket('bucket-name/dir-in-bucket')

The script stalls for about 10 seconds, and prints out an error afterwards. Bellow is the full trace. Any idea how to proceed with this?

Note question is about the boto version 2 module, not boto3.

Traceback (most recent call last):
  File "test_s3.py", line 7, in <module>
    bucket = conn.get_bucket('bucket-name/dir-name')
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 471, in get_bucket
    return self.head_bucket(bucket_name, headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 490, in head_bucket
    response = self.make_request('HEAD', bucket_name, headers=headers)
  File "/usr/local/lib/python2.7/dist-packages/boto/s3/connection.py", line 633, in make_request
    retry_handler=retry_handler
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 1046, in make_request
    retry_handler=retry_handler)
  File "/usr/local/lib/python2.7/dist-packages/boto/connection.py", line 922, in _mexe
    request.body, request.headers)
  File "/usr/lib/python2.7/httplib.py", line 958, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 992, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 954, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 814, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 776, in send
    self.connect()
  File "/usr/lib/python2.7/httplib.py", line 1157, in connect
    self.timeout, self.source_address)
  File "/usr/lib/python2.7/socket.py", line 553, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
socket.gaierror: [Errno -2] Name or service not known

Maybe you should use `my-bucket-url/dir-in-bucket` instead of `my-bucket-url/my-bucket-url` in your script? — Alex Lisovoy, Dec 04 '14 at 10:57
sorry, that was a mistake when trying to remove the actual bucket and dir names. — Martin Taleski, Dec 04 '14 at 12:22

M.Vanderlee · Accepted Answer · 2019-04-18T16:12:43.133

163

For boto3

import boto3

s3 = boto3.resource('s3')
my_bucket = s3.Bucket('my_bucket_name')

for object_summary in my_bucket.objects.filter(Prefix="dir_name/"):
    print(object_summary.key)

edited Apr 18 '19 at 16:12

answered Apr 14 '18 at 02:48

M.Vanderlee

2,847
2
19
16

2

You have to use use braces around object_summary.key to work in python3. print (object_summary.key) – Peycho Dimitrov Jul 26 '18 at 07:26
the weirdest part is Prefix="dir_name" worked find on my linux machine but to run this on lambda, it's important to use Prefix="dir_name/", I couldn't quite figure out why would the forward slash be significant to make it work on lambda. – yash Mar 09 '22 at 22:44
1

How to limit the search depth? – Gulzar Dec 25 '22 at 17:00
1

maybe due to OS diffferecnces @yash – Gunesh Shanbhag Dec 25 '22 at 17:57
@Gulzar if you know exactly what you want, don't paginate it's expensive, you could use further str filters once you get all the file names without retrieving any data. – yash Dec 28 '22 at 12:03

score 45 · Answer 2 · edited Jul 25 '20 at 15:11

45

By default, when you do a get_bucket call in boto it tries to validate that you actually have access to that bucket by performing a HEAD request on the bucket URL. In this case, you don't want boto to do that since you don't have access to the bucket itself. So, do this:

bucket = conn.get_bucket('my-bucket-url', validate=False)

and then you should be able to do something like this to list objects:

for key in bucket.list(prefix='dir-in-bucket'): 
    <do something>

If you still get a 403 Errror, try adding a slash at the end of the prefix.

for key in bucket.list(prefix='dir-in-bucket/'): 
    <do something>

Note: this answer was written about the boto version 2 module, which is obsolete by now. At the moment (2020), boto3 is the standard module for working with AWS. See this question for more info: What is the difference between the AWS boto and boto3

edited Jul 25 '20 at 15:11

Martin Taleski

6,033
10
40
78

answered Dec 04 '14 at 12:51

garnaat

44,310
7
123
103

thanks, this worked for me, I just needed to add a slash ('/') at the end of the bucket name, otherwise I still got the 403 error. – Martin Taleski Dec 04 '14 at 13:04
Yes, that makes sense. I approved your edit to my example. Glad its working for you. – garnaat Dec 04 '14 at 13:18
2

Why is the trailing "/" needed? I can confirm that it is required in my instance, but I couldn't find documentation of it. – dbn Dec 13 '16 at 00:34

gogasca · Answer 3 · 2018-07-27T05:07:20.380

Boto3 client:

import boto3

_BUCKET_NAME = 'mybucket'
_PREFIX = 'subfolder/'

client = boto3.client('s3', aws_access_key_id=ACCESS_KEY,
                            aws_secret_access_key=SECRET_KEY)

def ListFiles(client):
    """List files in specific S3 URL"""
    response = client.list_objects(Bucket=_BUCKET_NAME, Prefix=_PREFIX)
    for content in response.get('Contents', []):
        yield content.get('Key')

file_list = ListFiles(client)
for file in file_list:
    print 'File found: %s' % file

Using session

from boto3.session import Session

_BUCKET_NAME = 'mybucket'
_PREFIX = 'subfolder/'

session = Session(aws_access_key_id=ACCESS_KEY,
                  aws_secret_access_key=SECRET_KEY)

client = session.client('s3')

def ListFilesV1(client, bucket, prefix=''):
    """List files in specific S3 URL"""
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Prefix=prefix,
                                     Delimiter='/'):
        for content in result.get('Contents', []):
            yield content.get('Key')

file_list = ListFilesV1(client, _BUCKET_NAME, prefix=_PREFIX)
for file in file_list:
    print 'File found: %s' % file

In general: what is difference in going for boto3.resource or boto3.cleint or boto3.session based approach and which approach to be followed under what condition? — v.j, Feb 25 '19 at 10:51
For any confusion, boto3.resource is preferred. Also here is the difference between client and resource: https://stackoverflow.com/questions/42809096/difference-in-boto3-between-resource-client-and-session — Nagaraj Tantri, Jul 12 '21 at 12:06

score 12 · Answer 4 · answered Jul 19 '20 at 03:41

12

I just had this same problem, and this code does the trick.

import boto3

s3 = boto3.resource("s3")
s3_bucket = s3.Bucket("bucket-name")
dir = "dir-in-bucket"
files_in_s3 = [f.key.split(dir + "/")[1] for f in 
s3_bucket.objects.filter(Prefix=dir).all()]

answered Jul 19 '20 at 03:41

rob

249
3
3

1

this answer involves boto3, the original question was for the boto version 2 module. Nevertheless, by 2020 boto3 is the standard now – Martin Taleski Jul 25 '20 at 15:02
What .all() meens exactly ? According to documentation [link](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Bucket.objects) `all() Creates an iterable of all Bucket resources in the collection.`. but it is never used with filter() ? How to have only folder in a prefix (without subfolder) ? Thank you a lot @rob – iD_Sgh Nov 24 '22 at 18:18

score 8 · Answer 5 · answered May 07 '19 at 12:22

The following code will list all the files in specific dir of the S3 bucket:

import boto3

s3 = boto3.client('s3')

def get_all_s3_keys(s3_path):
    """
    Get a list of all keys in an S3 bucket.

    :param s3_path: Path of S3 dir.
    """
    keys = []

    if not s3_path.startswith('s3://'):
        s3_path = 's3://' + s3_path

    bucket = s3_path.split('//')[1].split('/')[0]
    prefix = '/'.join(s3_path.split('//')[1].split('/')[1:])

    kwargs = {'Bucket': bucket, 'Prefix': prefix}
    while True:
        resp = s3.list_objects_v2(**kwargs)
        for obj in resp['Contents']:
            keys.append(obj['Key'])

        try:
            kwargs['ContinuationToken'] = resp['NextContinuationToken']
        except KeyError:
            break

    return keys

score 5 · Answer 6 · edited Nov 17 '20 at 14:17

5

This can be done using:

s3_client = boto3.client('s3')
objects = s3_client.list_objects_v2(Bucket='bucket_name')
for obj in objects['Contents']:
  print(obj['Key'])

edited Nov 17 '20 at 14:17

Mykola Zotko

15,583
3
71
73

answered Jul 02 '20 at 12:21

KayV

12,987
11
98
148

score 4 · Answer 7 · answered Jan 03 '23 at 14:28

4

The simplest way to list objects of a specific prefix in S3 is to use awswrangler:

import awswrangler as wr
wr.s3.list_objects("s3://bucket_name/some/prefix/")

This will return a list of the objects stored in some/prefix/

answered Jan 03 '23 at 14:28

HagaiA

193
3
15

score 0 · Answer 8 · edited Feb 15 '18 at 16:49

0

If you want to list all the objects of a folder in your bucket, you can specify it while listing.

import boto
conn = boto.connect_s3(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
bucket = conn.get_bucket(AWS_BUCKET_NAME)
for file in bucket.list("FOLDER_NAME/", "/"):
    <do something with required file>

edited Feb 15 '18 at 16:49

MiguelFCerdan

37
9

answered Dec 25 '16 at 04:14

reetesh11

661
1
9
16

The OP mentioned that `get_bucket` was giving him a 403 – ChrisWue Mar 28 '17 at 01:06
How do I read the contents of one particular file to a variable from this? – RB17 Aug 28 '19 at 21:31
@Rahul I guess file.read() should work. But then I will have to check it once. – reetesh11 Aug 30 '19 at 05:25
@ChrisWue That could be because of wrong secret access key – reetesh11 Aug 30 '19 at 05:26
OP has specific access to a file or folder within a bucket, but doesn't have access to a bucket. I am in the same position, I can access files and folders within the AWS GUI, but I can't get anything done in boto3. – Michael Hayes Sep 05 '21 at 17:18

Python boto, list contents of specific dir in bucket

8 Answers8

Linked