0

I have a set of 3 JSON files which have an identical lay out. The number is expected to be much more when I push code into production, I only use 3 to keep the workflow fast.

The JSON files are structured like this

{
"results": [
    {
        "engagement": {
            "id": 2342,
            "portalId": 23423,
            "active": true,
            "createdAt": 1661855667536,
            "lastUpdated": 1661935264761,
            "modifiedBy": 3453
        },
        "associations": {
            "contactIds": [
                00000
            ],
            "companyIds": [],
            "dealIds": []
        },
        "attachments": [],
        "scheduledTasks": [],
        "metadata": {
            "status": "COMPLETED",
            "forObjectType": "CONTACT",
            "subject": "DEMO"
        }
    },
],
        "hasMore": true,
        "offset": 520,
        "total": 10523
    }

There could be up to 250 records in the 'Results' header, starting with 'engagement'.

I am trying to find a way to merge all 3 JSON files, with Python, where I contain only the data in "results" and drop the rest.

So far I am able to either add all 3 JSON's together but they are still separated by the different "results" headers or the last JSON overwrites the previously made file and I am no further.

The expected results would look like this:

  [
    {
        "engagement": {
            "id": 2342,
            "portalId": 23423,
            "active": true,
            "createdAt": 1661855667536,
            "lastUpdated": 1661935264761,
            "modifiedBy": 3453
        },
        "associations": {
            "contactIds": [
                00000
            ],
            "companyIds": [],
            "dealIds": []
        },
        "attachments": [],
        "scheduledTasks": [],
        "metadata": {
            "status": "COMPLETED",
            "forObjectType": "CONTACT",
            "subject": "DEMO"
        }
    },
],
[
    {
        "engagement": {
            "id": 2342,
            "portalId": 23423,
            "active": true,
            "createdAt": 1661855667536,
            "lastUpdated": 1661935264761,
            "modifiedBy": 3453
        },
        "associations": {
            "contactIds": [
                00000
            ],
            "companyIds": [],
            "dealIds": []
        },
        "attachments": [],
        "scheduledTasks": [],
        "metadata": {
            "status": "COMPLETED",
            "forObjectType": "CONTACT",
            "subject": "DEMO"
        }
    },
],
[
    {
        "engagement": {
            "id": 2342,
            "portalId": 23423,
            "active": true,
            "createdAt": 1661855667536,
            "lastUpdated": 1661935264761,
            "modifiedBy": 3453
        },
        "associations": {
            "contactIds": [
                00000
            ],
            "companyIds": [],
            "dealIds": []
        },
        "attachments": [],
        "scheduledTasks": [],
        "metadata": {
            "status": "COMPLETED",
            "forObjectType": "CONTACT",
            "subject": "DEMO"
        }
    },
],

Any help will be much appericiated.

2 Answers2

1

This is relatively simple, but I would restructure the resulting JSON a bit as the current structure does no make much sense.

What the code below does is simply load the files, and add all the list elements from the results dict to the final_result list. Basically now you have a list where each element of the list contains the part you needed from the original JSON files.

Then it saves that to a new file.

import json

filelist = ["file1.json", "file2.json", "file3.json"]
final_result = []


for filename in filelist:
    with open(filename) as infile:
        newdata = json.load(infile)
        final_result.extend(newdata["results"])

with open("result.json", "w") as outfile:
    json.dump(final_result, outfile, indent=4)

result.json

[
    {
        "engagement": {
            "id": 1,
            "portalId": 23423,
            "active": true,
            "createdAt": 1661855667536,
            "lastUpdated": 1661935264761,
            "modifiedBy": 3453
        },
        "associations": {
            "contactIds": [
                21345
            ],
            "companyIds": [],
            "dealIds": []
        },
        "attachments": [],
        "scheduledTasks": [],
        "metadata": {
            "status": "COMPLETED",
            "forObjectType": "CONTACT",
            "subject": "DEMO"
        }
    },
    {
        "engagement": {
            "id": 2,
            "portalId": 23423,
            "active": true,
            "createdAt": 1661855667536,
            "lastUpdated": 1661935264761,
            "modifiedBy": 3453
        },
        "associations": {
            "contactIds": [
                21345
            ],
            "companyIds": [],
            "dealIds": []
        },
        "attachments": [],
        "scheduledTasks": [],
        "metadata": {
            "status": "COMPLETED",
            "forObjectType": "CONTACT",
            "subject": "DEMO"
        }
    },
    {
        "engagement": {
            "id": 3,
            "portalId": 23423,
            "active": true,
            "createdAt": 1661855667536,
            "lastUpdated": 1661935264761,
            "modifiedBy": 3453
        },
        "associations": {
            "contactIds": [
                21345
            ],
            "companyIds": [],
            "dealIds": []
        },
        "attachments": [],
        "scheduledTasks": [],
        "metadata": {
            "status": "COMPLETED",
            "forObjectType": "CONTACT",
            "subject": "DEMO"
        }
    }
]

For getting files from a directory I have this function. It needs a file path and optionally a file extension. It returns a list of filenames which you can use with the code above. If you need to get files from multiple directories, you can just extend the list of filenames as shown below...

import os
def get_files_from_path(path: str = ".", ext: str or list(str) = None) -> list:
    """Find files in path and return them as a list.
    Gets all files in folders and subfolders
    See the answer on the link below for a ridiculously
    complete answer for this.
    https://stackoverflow.com/a/41447012/9267296
    Args:
        path (str, optional): Which path to start on.
                              Defaults to '.'.
        ext (str/list, optional): Optional file extention.
                                  Defaults to None.
    Returns:
        list: list of full file paths
    """
    result = []
    for subdir, dirs, files in os.walk(path):
        for fname in files:
            filepath = f"{subdir}{os.sep}{fname}"
            if ext == None:
                result.append(filepath)
            elif type(ext) == str and fname.lower().endswith(ext.lower()):
                result.append(filepath)
            elif type(ext) == list:
                for item in ext:
                    if fname.lower().endswith(item.lower()):
                        result.append(filepath)
    return result


filelist = get_files_from_path("your/path/here/", ext=".json")
filelist.extend(get_files_from_path("another/path/here/", ext=".json"))
Edo Akse
  • 4,051
  • 2
  • 10
  • 21
  • Man, thanks for your quick response! One question, as I've run into this before. If the JSON file are not in the same directory, how to you give this code the right path to where the JSON files are stored? – Opper_Draak Sep 01 '22 at 14:08
  • I did just move the JSON files to the same directory just to see if it got the right results. However, it seems to only pick up the first instance of "results" where the JSON files could hold over 200 of those clusters. How would we go about that? – Opper_Draak Sep 01 '22 at 14:16
  • 2
    In addition, you can also list/use all json files inside a directory ```json_files = [jfile for jfile in os.listdir(path) if jfile.endswith('.json')]``` – Ashkan Sep 01 '22 at 14:32
  • Thanks, the code now looks like `path = '/content/extracted_data/' json_files = [jfile for jfile in os.listdir(path) if jfile.endswith('.json')] final_result = [] for filename in json_files: with open(path+filename) as infile: newdata = json.load(infile) # grab the first list element from the results dict newdata = newdata ["results"][0] final_result.append(newdata) with open("result.json", "w") as outfile: json.dump(final_result, outfile, indent=4)` But still having the issue with the number of results... – Opper_Draak Sep 01 '22 at 14:44
  • OK, so if the `results` element of the JSON files contains more than just a single result you can use `list.extend(other_list)` to add all the elements of `results` to the `final_results`. See my updated answer. As for finding files in directories, I have a function in my github repo I re-use for this scenario. I'll put that in the updated answer as well. – Edo Akse Sep 02 '22 at 10:23
0

After help from @Edo Aske I found a solution to the problem. Final code looks like this:

path = '/content/extracted_data/'
json_files = [jfile for jfile in os.listdir(path) if jfile.endswith('.json')]
final_result = []


for filename in json_files:
    with open(path+filename) as infile:
        newdata = json.load(infile)
        # grab the first list element from the results dict
        newdata = newdata ["results"]
        final_result.extend(newdata)

with open("result.json", "w") as outfile:
    json.dump(final_result, outfile, indent=4)

The results is that all the JSON files are in separate Dicts and from there we can easily put them in a dataframe using pd.json_normalize.

Thanks for all your help guys!