2

I am currently parsing two JSON files using Python 2.7. The goal is to check each JSON object in file1 to each JSON object on file2 and compare them using their 'name' keys. If there is a match then overwrite obj2 with obj1 data. My psuedocode (below) right now would run in O(n^4) time. That is way too slow so if anyone can point out a faster method I'd appreciate it.

for obj1 in file1:
   for key1, value1 in obj1.iteritems():
       if key1 == 'name':
           for obj2 in file2:
               for key2, value2 in obj2.iteritems():
                   if key2 == 'name':
                       if value1 == value2:
                           overwrite obj2 using obj1 data
sbru
  • 877
  • 6
  • 21
  • 36

2 Answers2

4

Store your objects from file1 in a dictionary, keyed by name:

file1_names = {}
for obj1 in file1:
    if 'name' not in obj1:
        continue
    file1_names.setdefault(obj1['name'], []).append(obj1)

Now you can look up these objects in O(1) time now:

for obj2 in file2:
    if 'name' not in obj2:
        continue
    for obj1 in file1_names.get(obj2['name'], []):
        obj2.update(obj1)

The above scans through file1 and file2 just once each, making the overall timecomplexity O(N) where N is the total number of objects in the two files.

I've made the following assumptions:

  • Names in obj1 are not unique, so they are collected per name into lists.
  • The 'name' key could be missing.

If these assumptions don't hold (so names are unique and always given), you can simplify the above to:

file1_names = {o['name']: o for o in file1}
for obj2 in file2:
    obj2.update(file1_names.get(obj2['name'], {}))
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • This is exactly what I was looking for, Thanks @Martijn! Can you explain how this is O(1) though? Isn't it O(2n) -> O(n) because you have to iterate through each file once? – sbru Sep 12 '14 at 18:45
  • Could you explain what is going on in the bottom block of code? I'm not familiar with a lot of that syntax. – sbru Sep 12 '14 at 18:54
  • 1
    @bagelboy: you mean the `{key_expr: value_expr for target in iterable}` syntax? That's a dictionary comprehension; it builds a dictionary just like a list comprehension builds a list. Except there are two expressions per loop iteration, one for the key and another for the value. – Martijn Pieters Sep 12 '14 at 18:56
  • 1
    @bagelboy: see [Python: create a dictionary with list comprehension](http://stackoverflow.com/q/1747817) – Martijn Pieters Sep 12 '14 at 18:56
  • I see, makes sense. In your example, you use 'o' for both the key, value, and target. How is that allowed? – sbru Sep 12 '14 at 19:05
  • @bagelboy: The target is the loop variable, just like `for o in file1:` in a regular loop. The other two `o` references then use that loop target to create a key (`o['name']`) and a value (`o` itself)`. That's as if you did `file1_names[o['name']] = o` in a `for` loop. – Martijn Pieters Sep 12 '14 at 19:13
  • I see, but doesn't target variable 'o' contain a JSON object that depends on the iteration? If so, I understand you select `o['name']` for the key, but then how can you use 'o' for the value if 'o' is a whole JSON object? – sbru Sep 12 '14 at 19:21
  • @bagelboy: why would that be a problem? Inside the loop `o` is a reference to one of the dictionaries, and we just store a reference to the same in the resulting dictionary object as a value. – Martijn Pieters Sep 12 '14 at 20:58
  • So your making a new dictionary where the keys are the values of the 'name' key in each old object and the value for these keys in the new dictionary is a reference to the old object? Like originally you have this object `{'k1':'v1', 'k2','v2', 'name':'v3}` then when you make the new dictionary it will contain: `{'name':'{'k1':'v1', 'k2','v2', 'name':'v3}'}` – sbru Sep 12 '14 at 21:07
  • @bagelboy: no, you have dictionaries in a *list*; we are now making a dictionary containing those same dictionaries from the list. So you have `[{'foo': 'bar', 'name': 'ham'}, {'baz': 'bam', 'name': 'spam'}]` and we are transforming that to `{'ham': {'foo': 'bar', 'name': 'ham'}, 'spam': {'baz': 'bam', 'name': 'spam'}}`. `o` is bound to each dictionary from the input list in turn. – Martijn Pieters Sep 12 '14 at 21:12
  • Oh oops, that's what I meant, 'v3' instead of 'names' for the key of the new dict. Looking at this code though, this would only work if each JSON object was on its own single line in the file. – sbru Sep 12 '14 at 21:16
  • @bagelboy: your *original code* makes that assumption. Nowhere do you load JSON objects from a file; all you presented us with was a list of already-decoded objects. – Martijn Pieters Sep 12 '14 at 21:18
  • 1
    @bagelboy: loading JSON objects from a file either line by line or otherwise is a different problem; see [Loading & Parsing JSON file in python](http://stackoverflow.com/a/12451465) and [How do I use the json module to read in one json object at a time?](http://stackoverflow.com/a/21709058) – Martijn Pieters Sep 12 '14 at 21:19
  • Ahh, very useful resources however i just solved the problem by using: `json.load(file1)` – sbru Sep 12 '14 at 21:49
  • @bagelboy: Right, but then you have just the *one* JSON object, a list with dictionaries presumably. – Martijn Pieters Sep 12 '14 at 21:50
  • It's just a dictionary composed of keys as 'name' and values as dictionaries relating to that 'name' – sbru Sep 12 '14 at 21:53
1

How big is your files? Is there any concerns loading them into memory? I will do something like following pseudocodes:

I am assuming obj1, obj2 are dictionaries since you are using iteritems.

dict1 = dict( (o['name'], o) for o in file1 )
dict2 = dict( (o['name'], o) for o in file2 )
dict2.update(dict1)
  • You are updating data incorrectly. Dict1 data should overwrite dict2 data, not the other way around, and you should probably not replace the nested dictionaries wholesale. – Martijn Pieters Sep 12 '14 at 18:47
  • 1
    There is also a typo in your code; the parentheses are not balanced in the second line. And you don't need to build a list first; removing the `[...]` in both lines would make the code more efficient. – Martijn Pieters Sep 12 '14 at 18:55
  • Fix the typo. And thanks for pointing out [...]. I was not aware you can do list comprehension with dict() directly. – Roger Jiang Sep 12 '14 at 19:55
  • It's called a generator expression. Note that you now replace items in `dict2` entirely; if any of the `o` objects in `dict2` had *more* keys than the corresponding objects in `dict1`, then those keys are now lost. – Martijn Pieters Sep 12 '14 at 20:35