The following is the object structure in S3 bucket:
s3://bucket/
open-images/
apple/
images/
file112.jpg
...
pascal/
file112.xml
...
Objective
Convert the XML files to JSON and put the file under json/
. So the object structure under the S3 bucket looks like:
s3://bucket/
open-images/
apple/
images/
file112.jpg
...
pascal/
file112.xml
...
json/
file112.json
...
My Approach
for obj in bucket.objects.filter(Prefix="open-images/", Delimiter='jpg'):
if "xml" in obj.key:
# generating destination path for storing json files in sage maker instance
xml_file_name = obj.key
start,end = xml_file_name.split("pascal")
dest_path = start+"json"+end
# converting xml to json
xml_file = obj.get()['Body']
data_dict = xmltodict.parse(xml_file.read())
xml_file.close()
json_data = json.dumps(data_dict)
# writing json file to s3
# storing json file under the destination path in sage maker instance
os.makedirs(start+"json")
with open("{}.json".format(dest_path[:-4]), "w") as json_file:
json_file.write(json_data)
json_file.close()
# copying the json file to s3
os.system('aws s3 cp --recursive "./open-images/" "s3://<bucket_name>/open-images/"')
# deleting json file from sage maker instance to avoid memory error
shutil.rmtree("open-images/{}/".format(start[12:]))
Question
Is there a better way to do this?