0

I'm trying to write some code that opens up about 900 nested dictionaries with roughly 99% similar content (very large) and store the value of each key in a list named after the key. For example, if I had two dictionaries: {data=37} {data=74} I would want to combine those two values into a list named data that outputs the following [37,74]

Here is the code I'm currently using to accomplish this:

import pandas as pd
df = pd.read_csv("/Users/---.csv")
i=True
def sort(d):
    for k, v in d.items():
        if isinstance(v, dict):
            sort(v)
        else:
            global i
            if i==True:
                print("{0} : {1}".format(k, v))
                setattr(sys.modules[__name__], k, [v])
                i=False
            else:
                print("{0} : {1}".format(k, v))
                globals()["{}".format(k)].append(v)

            
for i in df['file_num']:
    with open("/Users/--/allDAFs{}.json".format(i)) as f:
        data=json.load(f)
    sort(data)

The problem with this is two fold: a. There are some duplicates and I'm not sure why. There are 1400 values for some key when there are only 900 files. b. I can't link these to a file_num. As you can see I'm sorting through these using file_num and I'd like to link each value to the file_num it came from.

I know I may not be doing this the best way possible so any insight or advice would be greatly appreciated.

EDIT: This is how I need the end result to look like, preferably in a pandas DataFrame:

enter image description here

  • 1
    Have you seen this answer? https://stackoverflow.com/a/40589031/6692898 – RichieV Aug 05 '20 at 16:39
  • @RichieV I tried to normalize each JSON file but I'm getting an error. (str has no attribute values). I'm not even sure how I would do this for 800 files and combine it into one DataFrame. – Matthew Kaplan Aug 05 '20 at 17:04
  • If you have duplicates of the same key in one file (by means of having nested dictionaries) then the problem lies in your data source and is not python related – dnalow Aug 06 '20 at 14:07

2 Answers2

1

I would use a defaultdict. Maybe I'm missing something, but I don't really see the problem.

import pandas as pd
import collections


def sort(d, output):
    for k, v in d.items():
        if isinstance(v, dict):
            sort(v, output)
        else:
            output[k].append(v)


df = pd.read_csv("/Users/---.csv")
results = collections.defaultdict(list)

for i in df['file_num']:
    with open("/Users/--/allDAFs{}.json".format(i)) as f:
        data = json.load(f)
    sort(data, results)

Then, results will be a dictionary of lists that you can address by same keys.

For your issues with ZIPcodes, make sure, the keys are alays the same datatype (str), maybe even ascii.

dnalow
  • 974
  • 4
  • 14
  • I appreciate your answer, this is definitely a simpler way of doing what I was doing but it achieves the exact same results. I checked the ZIPcodes and they are always strings but I'm still getting duplicates. Also, this may complicate things a little bit but I want to be able to make each file a row in a dataset, so that the every value from the same file can be matched up. Do you have any advice on how I can address this? I edited what I would like my end result to look like into the main post. – Matthew Kaplan Aug 05 '20 at 16:52
  • @MatthewKaplan, what exactly do you mean with duplicates? Do you mean that the length some lists is longer than others? This can of course happen because you have several dictionaries in one file... – dnalow Aug 06 '20 at 14:05
0

I used df.json_normalize() like Richie suggested, which worked well.