How to append multiple files into one in Amazon's s3 using Python and boto3?

Question

I have a bucket in Amazon's S3 called test-bucket. Within this bucket, json files look like this:

test-bucket
    | continent
        | country
            | <filename>.json

Essentially, filenames are continent/country/name/. Within each country, there are about 100k files, each containing a single dictionary, like this:

{"data":"more data", "even more data":"more data", "other data":"other other data"}

Different files have different lengths. What I need to do is compile all these files together into a single file, then re-upload that file into s3. The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:

import json


def append_to_file(data, filename):
    with open(filename, "a") as f:
        json.dump(record, f)
        f.write("\n")

However, I do not know all the filenames (the names are a timestamp). How can I read all the files in a folder, e.g. Asia/China/*, then append them to a file, with the filename being the country?

Optimally, I don't want to have to download all the files into local storage. If I could load these files into memory that would be great.

EDIT: to make things more clear. Files on s3 aren't stored in folders, the file path is just set up to look like a folder. All files are stored under test-bucket.

you can adapt my old answer to read all file within specific prefix https://stackoverflow.com/questions/42673764/boto3-s3-get-files-without-getting-folders/42691511#42691511 — mootmoot, Aug 01 '17 at 13:58

score 1 · Answer 1 · answered Apr 09 '21 at 16:02

The answer to this is fairly simple. You can list all files in the bucket using a filter to filter it down to a "subdirectory" in the prefix. If you have a list of the continents and countries in advance, then you can reduce the list returned. The returned list will have the prefix, so you can filter the list of object names to the ones you want.

    s3 = boto3.resource('s3')
    bucket_obj = s3.Bucket(bucketname)

    all_s3keys = list(obj.key for obj in bucket_obj.objects.filter(Prefix=job_prefix))

    if file_pat:
        filtered_s3keys = [key for key in all_s3keys if bool(re.search(file_pat, key))]
    else:
        filtered_s3keys = all_s3keys

The code above will return all the files, with their complete prefix in the bucket, exclusive to the prefix provided. So if you provide prefix='Asia/China/', then it will provide a list of the files only with that prefix. In some cases, I take a second step and filter the file names in that 'subdirectory' before I use the full prefix to access the files.

The second step is to download all the files:

    with concurrent.futures.ThreadPoolExecutor(max_workers=MAX_THREADS) as executor:
        executor.map(lambda s3key:  bucket_obj.download_file(s3key, local_filepath, Config=CUSTOM_CONFIG),                         
                    filtered_s3keys)

for simplicity, I skipped showing the fact that the code generates a local_filepath for each file downloaded so it is the one you actually want and where you want it.

How to append multiple files into one in Amazon's s3 using Python and boto3?

1 Answers1