-2

I have an s3 bucket, with files under a folder structure like folder1/folder2 I want to list the files under the folder structure only and iterate through the files in a Sagemaker Jupyter notebook.

How can I achieve this? I tried the instructions in Listing contents of a bucket with boto3 but was able to list only at the top level recursively. But I want to list only at the folder level.

I also tried the below code snippet

import boto3
s3 = boto3.resource('s3')
my_bucket = s3.Bucket('bucketname/folder1/folder2')
for my_bucket_object in my_bucket.objects.all():
    print(my_bucket_object)

and got the below error

ParamValidationError: Parameter validation failed:
Invalid bucket name...

Using Python 3.9 currently. Thanks!

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
FlyingPickle
  • 1,047
  • 1
  • 9
  • 19

1 Answers1

5

A few issues here:

  1. bucketname is the bucket name
  2. folder1/folder2/ is the key prefix
  3. you need to filter the list, not get all objects

Try:

import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('bucketname')
for object_summary in bucket.objects.filter(Prefix='folder1/folder2/'):
    print(object_summary)

That will result in a list of ObjectSummary values being printed, for example:

s3.ObjectSummary(bucket_name='bucketname', key='folder1/folder2/')
s3.ObjectSummary(bucket_name='bucketname', key='folder1/folder2/abc.csv')
s3.ObjectSummary(bucket_name='bucketname', key='folder1/folder2/def.csv')
s3.ObjectSummary(bucket_name='bucketname', key='folder1/folder2/xyz.png')
s3.ObjectSummary(bucket_name='bucketname', key='folder1/folder2/folder3/')

Note that it will include all objects at the folder1/folder2/ level, regardless of their file extension suffix, and it will potentially include an indication of the folder itself (folder1/folder2/) and any logical sub-folders such as folder1/folder2/folder3/.

You can retrieve the objects from an object summary as follows:

for object_summary in bucket.objects.filter(Prefix="folder1/folder2/"):
    print(object_summary.Object().key)

That will result in a list of Object keys being printed, for example:

folder1/folder2/
folder1/folder2/abc.csv
folder1/folder2/def.csv
folder1/folder2/xyz.png
folder1/folder2/folder3/

You can filter these to get just CSVs, as needed, for example:

summaries = bucket.objects.filter(Prefix="folder1/folder2/")
csvs = [x for x in summaries if x.Object().key.endswith(".csv")]

for objectsummary in csvs:
    print(objectsummary.Object().key)

That will result in:

folder1/folder2/abc.csv
folder1/folder2/def.csv

And you can split out the actual filename, as follows:

for objectsummary in csvs:
    print(objectsummary.Object().key.split("/")[-1])

That will result in:

abc.csv
def.csv
jarmod
  • 71,565
  • 16
  • 115
  • 122
  • Thank you @jarmod...I would presume this would print the file names in the directory...but what I get is an empty output...the folder is full of CSV files...any thoughts on how I can check this further? – FlyingPickle Apr 05 '23 at 04:52
  • @FlyingPickle I've updated the answer with more details of the ObjectSummary and Object values. Can you comment here if you're still seeing a problem. Also, double-check your objects using the awscli e.g. `aws s3 ls s3://bucketname/folder1/folder2/`. – jarmod Apr 05 '23 at 12:05