8

I am trying to get a list of parquet files paths from s3 that are inside of subdirectories and subdirectories of subdirectories (and so on and so forth).

If it was my local file system I would do this:

import glob 

glob.glob('C:/Users/user/info/**/*.parquet', recursive=True)

I have tried using the glob method of s3fs however it doesn't have a recursive kwarg.

Is there a function I can use or do I need to implement it myself ?

Andrew Gaul
  • 2,296
  • 1
  • 12
  • 19
moshevi
  • 4,999
  • 5
  • 33
  • 50

3 Answers3

11

You can use s3fs with glob:

import s3fs

s3 = s3fs.S3FileSystem(anon=False)

s3.glob('your/s3/path/here/*.parquet')
Yu Hao
  • 119,891
  • 44
  • 235
  • 294
Harrold
  • 111
  • 1
  • 3
  • 3
    Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 14 '21 at 09:52
  • `s3fs` is a good option until you need recursive search as I indicated in my question – moshevi Mar 20 '22 at 13:32
6

I also wanted to download the latest file from s3 bucket but located in a specific folder. Initially, I tried using glob but couldn't find a solution to this problem. Finally, I build following function to solve this problem. You can modify this function to work with subfolders.

This function will return dictionary of all filenames and timestamp in key-value pair

(Key: file_name, value: timestamp).

Just pass bucket name and prefix (which is folder name).

import boto3

def get_file_names(bucket_name,prefix):
    """
    Return the latest file name in an S3 bucket folder.

    :param bucket: Name of the S3 bucket.
    :param prefix: Only fetch keys that start with this prefix (folder  name).
    """
    s3_client = boto3.client('s3')
    objs = s3_client.list_objects_v2(Bucket=bucket_name)['Contents']
    shortlisted_files = dict()            
    for obj in objs:
        key = obj['Key']
        timestamp = obj['LastModified']
        # if key starts with folder name retrieve that key
        if key.startswith(prefix):              
            # Adding a new key value pair
            shortlisted_files.update( {key : timestamp} )   
    return shortlisted_files

latest_filename = get_latest_file_name(bucket_name='use_your_bucket_name',prefix = 'folder_name/')
Sayali Sonawane
  • 12,289
  • 5
  • 46
  • 47
1

S3 doesn't actually have subdirectories, per se.

boto3's S3.Client.list_objects() supports a prefix argument, which should get you all the objects in a given "directory" in a bucket no matter how "deep" they appear to be.

AKX
  • 152,115
  • 15
  • 115
  • 172