0

The following is the object structure in S3 bucket:

s3://bucket/
    open-images/
        apple/
            images/
                file112.jpg
                ...
            pascal/
                file112.xml
                ...

Objective

Convert the XML files to JSON and put the file under json/. So the object structure under the S3 bucket looks like:

s3://bucket/
    open-images/
        apple/
            images/
                file112.jpg
                ...
            pascal/
                file112.xml
                ...
            json/
                file112.json
                ...

My Approach

for obj in bucket.objects.filter(Prefix="open-images/", Delimiter='jpg'):
    if "xml" in obj.key:

        # generating destination path for storing json files in sage maker instance
        xml_file_name = obj.key
        start,end = xml_file_name.split("pascal")
        dest_path = start+"json"+end
        
        # converting xml to json
        xml_file = obj.get()['Body']
        data_dict = xmltodict.parse(xml_file.read())
        xml_file.close()
        json_data = json.dumps(data_dict)
        
        # writing json file to s3
        # storing json file under the destination path in sage maker instance
        os.makedirs(start+"json")
        with open("{}.json".format(dest_path[:-4]), "w") as json_file:
            json_file.write(json_data)
            json_file.close()
        # copying the json file to s3
        os.system('aws s3 cp --recursive "./open-images/" "s3://<bucket_name>/open-images/"')
        # deleting json file from sage maker instance to avoid memory error 
        shutil.rmtree("open-images/{}/".format(start[12:]))

Question

Is there a better way to do this?

Tomalak
  • 332,285
  • 67
  • 532
  • 628
iamarchisha
  • 175
  • 7
  • Wouldn't it make sense to write the JSON files back to the S3 bucket immediately, instead of collecting them on your local file system first? – Tomalak May 10 '21 at 10:15
  • Yeah it would and that's where I need help. I was not able to do that. What I have done is very naive and inefficient. – iamarchisha May 10 '21 at 13:08
  • https://stackoverflow.com/questions/40336918/how-to-write-a-file-or-data-to-an-s3-object-using-boto3 – Tomalak May 10 '21 at 13:12
  • Are you talking about something like the suggested approach I have added to the question? In the suggested approach aren't we still first writing on the local then putting to S3? – iamarchisha May 10 '21 at 15:00
  • That's a code sample that should get you started with writing files directly to S3. There is no need to write the data to the local file system unless you need the files locally *as well*. – Tomalak May 10 '21 at 15:11
  • Oh okay! Got it. Thanks for the suggestion. – iamarchisha May 10 '21 at 23:17
  • 1
    Still not quite. You're still writing a local file. That's completely unnecessary. Create the JSON in memory (as a string) and store it in S3. It's only two lines: `object = s3.Object(bucket_name, f"{dest_path[:-4]}.json")` and `object.put(Body=json.dumps(data_dict))`. – Tomalak May 11 '21 at 06:55
  • The above lines of code does not create a 'json' object in the S3 bucket and write the json. The code does execute without errors but does not give the desired results – iamarchisha May 12 '21 at 05:19
  • 1
    [The answer I took this from](https://stackoverflow.com/a/54272479/18771) comes directly from the thread I linked to above, it has 50+ upvotes, and none of the comments below it says that it doesn't work. Given that this is not the only answer in that thread that does it that way, it must work pretty much exactly like that. That's all I can say, I don't have an S3 account to test with unfortunately. Take some time and read through all of the answers in the other thread (and if needed a couple of other threads on the topic). – Tomalak May 12 '21 at 07:12
  • Okay. I will go through it. Thank you so much for your time @Tomalak – iamarchisha May 12 '21 at 10:12
  • You'll figure it out, I'm sure. It's practically impossible that the code runs without errors and yet creates nothing in the S3 Bucket. There must be something simple that you overlook. I'd start with checking that the target file path is what you expect it to be. Maybe it creates the files, but somewhere else? – Tomalak May 12 '21 at 10:24
  • 1
    Maybe you should start with enabling debug logging in boto3: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/core/boto3.html, see the `set_stream_logger` method. This way you can at least see what happens and don't have to guess. – Tomalak May 12 '21 at 10:29
  • It was a very silly thing I did. It was creating the file but with some other name. `f"dest_path[:-4].json"` isn't needed. Just `dest_path[:-4].json` works. I am new to StackOverflow so just wanted to know if it is okay to post answer based on someone's reply in comments? – iamarchisha May 12 '21 at 10:36
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/232280/discussion-between-iamxxvi-and-tomalak). – iamarchisha May 12 '21 at 10:40
  • That's why I wrote `f"{dest_path[:-4]}.json"`, not `f"dest_path[:-4].json"`. There is a difference - one has curly braces, the other has not. See [Formatted String Literals](https://docs.python.org/3/tutorial/inputoutput.html#tut-f-strings). – Tomalak May 12 '21 at 10:41
  • ...and yes, it's perfectly fine (even encouraged) to answer your own questions on Stack Overflow. – Tomalak May 12 '21 at 10:42
  • I was wondering about those braces. But thanks for bringing that to light – iamarchisha May 12 '21 at 10:45

1 Answers1

1

A better approach as suggested by @Tomalak would be to directly write the json files in S3 objects instead of writing them on local and the copying to S3. So the final, better and faster code looks like this:

import os
import json
import glob
import shutil
import logging
import boto3
import xmltodict

#initiate s3 resource
s3 = boto3.resource('s3')
# select bucket
bucket_name= "<bucket_name>"
bucket = s3.Bucket(bucket_name)

for obj in bucket.objects.filter(Prefix="<key>", Delimiter='jpg'):
    
    if "xml" in obj.key:
        # generating final destination path
        xml_file_name = obj.key
        start,end = xml_file_name.split("pascal")
        dest_path = start+"json"+end
        
        # converting xml to json
        xml_file = obj.get()['Body']
        data_dict = xmltodict.parse(xml_file.read())
        xml_file.close()
        json_data = json.dumps(data_dict)

        # writing json file to s3
        object = s3.Object(bucket_name, dest_path[:-4]+'.json')
        object.put(Body=json.dumps(data_dict))
iamarchisha
  • 175
  • 7
  • 1
    The line `json_data = json.dumps(data_dict)` is no longer necessary here. Also, `object` is a so-called [built-in identifier](https://stackoverflow.com/a/22864250/18771) in Python, similar to `str` or `float`. As a general "code hygiene" rule, avoid using such words for variable names. Python lets you do it, but it will only lead to problems at some point. – Tomalak May 12 '21 at 10:49
  • 1
    A clean way to avoid it in this case would be not to use a variable name at all: `s3.Object(bucket_name, dest_path[:-4]+'.json').put(Body=json.dumps(data_dict))`. – Tomalak May 12 '21 at 11:15