I have a bucket in Amazon's S3 called test-bucket
. Within this bucket, json files look like this:
test-bucket
| continent
| country
| <filename>.json
Essentially, filenames are continent/country/name/
. Within each country, there are about 100k files, each containing a single dictionary, like this:
{"data":"more data", "even more data":"more data", "other data":"other other data"}
Different files have different lengths. What I need to do is compile all these files together into a single file, then re-upload that file into s3. The easy solution would be to download all the files with boto3, read them into Python, then append them using this script:
import json
def append_to_file(data, filename):
with open(filename, "a") as f:
json.dump(record, f)
f.write("\n")
However, I do not know all the filenames (the names are a timestamp). How can I read all the files in a folder, e.g. Asia/China/*
, then append them to a file, with the filename being the country?
Optimally, I don't want to have to download all the files into local storage. If I could load these files into memory that would be great.
EDIT: to make things more clear. Files on s3 aren't stored in folders, the file path is just set up to look like a folder. All files are stored under test-bucket
.