Merge two (or more) lists of dictionaries pairing using a specific key

Question

I have a set of different lists of dictionaries (actually obtained reading Excel worksheets) and I need to do an "inner join" on them:

each list is equivalent to a database table (each dict is a record)
each record has a specific key guaranteed unique in the list (column is "index")
I need to produce another list of dictionaries where each dictionary has a given "index" and all other key/value found in all lists where "index" match

To exemplify:

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

and I want yo have:

c = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

of course problem is various list may have different length and not be sorted nicely.

Is there an easy way to do this or should I do nested loops explicitly searching for the matching record?

This actually works, but I'm VERY unsure it's the "most pythonic way":

a = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

c = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

li = [a, b]
t = [{z['idx']: z for z in w} for w in li]
r = {}
for k in t:
    for j in k:
        if j in r:
            r[j].update(k[j])
        else:
            r[j] = k[j]
r = [t for t in r.values()]

print(r)

[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'}, {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'}, {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}, {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

Can someone come up with something better?

Perhaps this answers your question: https://stackoverflow.com/a/5501893/10226703 — sebtheiler, Jul 12 '22 at 12:47
What about keys that only appear in one of `a` and `b`, but not both? — Scott Hunter, Jul 12 '22 at 12:50
@ScottHunter: I should have a complete set of indexes, each with the "fields" it has. I updated OP to reflect. — ZioByte, Jul 12 '22 at 13:10
I don't understand the `'fie': 'xx0'` in row 1 of `c` and `'fie': 'xx1'` in row 2 of `c`: Could you explain them? — Timus, Jul 12 '22 at 14:04
@Timus You mean is it a typo, right? I'm thinking r is result, and c is expected, so it should be c[0]['fie'] == 'zz0' ? — Kenny Ostrom, Jul 12 '22 at 14:10
@KennyOstrom Yes, exaxtly (yes: typo; yes: your interpretation) — Timus, Jul 12 '22 at 14:13
It looks like this is basically done the same as I would have done it, using idx as a dict key. It just needs better variable names, maybe defaultdict, maybe itertools.chain. My version is prettier but basically identical. Because it works and there are no bugs or issues to resolve, this probably should be in the codereview forum https://codereview.stackexchange.com/ — Kenny Ostrom, Jul 12 '22 at 14:20
@Timus: sorry, my bad. I corrected the typo and added the actual output of my code. Question remains: is this (about) the "mot pythonic" way? — ZioByte, Jul 12 '22 at 14:23

Kenny Ostrom · Answer 1 · 2022-07-12T17:20:05.670

This is basically the same as your code, as far as the algorithm. You had the right idea using O(1) dict lookup, and update to merge the dicts.

from itertools import chain
from collections import defaultdict
from pprint import pprint

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

KEY = 'idx'
merged = defaultdict(dict)
for row in chain(a, b):
    merged[row[KEY]].update(row)

pprint(list(merged.values()))

I tried not to use any single letter variable names (besides the original inputs)
itertools.chain lets you iterate over several iterables as one
defaultdict hides some of that "if it's in there already, do this, otherwise do that"
[x for x in iterable] could be written list(iterable)
The "merged" data structure is more useful. It's a shame to dump it out to an inefficient list, but that was the requirement.

If possible, you could return merged.values(), which is an iterable view object https://docs.python.org/3.7/library/stdtypes.html?highlight=dict%20values#dictionary-view-objects

Concerns:
This could be handled in a database or pandas, which are designed for this exact function.
What if the rows happen to have a conflict on one of the data fields? You'll never know, as update will just overwrite.

That's way better than my attempt, nice. – Timus Jul 12 '22 at 17:06 — Timus, Jul 12 '22 at 17:06

score 1 · Accepted Answer · answered Jul 12 '22 at 14:25

1

I'm not sure if this is more efficient than your solution:

from operator import itemgetter
from itertools import chain, groupby

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

c = sorted(a + b, key=itemgetter('idx'))
c = [
    dict(chain(*(record.items() for record in group)))
    for _, group in groupby(c, key=itemgetter('idx'))
]

Result:

[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
 {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
 {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
 {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

answered Jul 12 '22 at 14:25

Timus

10,974
5
14
28

sorting is less efficient than simple iteration, and you need the sort for groupby to work – Kenny Ostrom Jul 12 '22 at 14:30
@KennyOstrom Yes, that's what I suspect too. – Timus Jul 12 '22 at 14:45
1

It surely is less understandable for me, bu that's an excellent reason ti dig deeper into `itertools` an friends. Fr this reason (and fact it actually works) I'll accept your answer ;) – ZioByte Jul 12 '22 at 14:46

Guy · Answer 3 · 2022-07-12T14:38:30.213

If you are using Python 3.9 you can use the union operator or update() in older versions (added a third shorter list to the example)

a = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0'}, {'idx': 1, 'foo': 'xx1', 'bar': 'yy1'}, {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'}, {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}, {'idx': 2, 'fie': 'zz2', 'fom': 'kk2'}]
c = [{'idx': 0, 'ief': 'zz0', 'mof': 'kk0'}, {'idx': 1, 'ief': 'zz1', 'mof': 'kk1'}]

lists = [b, c]

# with union
for lst in lists:
    for i, d in enumerate(lst):
        a[i] = a[i] | d

# with update
for lst in lists:
    for i, d in enumerate(lst):
        a[i].update(d)

print(a)

Edit:

If the dictionaries are not sorted or don't have the same keys you can sort the during the merge and add the missing keys

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

a.sort(key=lambda x: x['idx'])
lists = [b, c]
for lst in lists:
    lst.sort(key=lambda x: x['idx'])
    for i, d in enumerate(lst):
        if d['idx'] == a[i]['idx']:
            a[i] = a[i] | d
        else:
            a.append(d)
print(a)

Output

[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'xx0', 'fom': 'kk0'},
 {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'xx1', 'fom': 'kk1'},
 {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
 {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

I am unsure this will work if some key is missing in some dictionary but present in others. Can you have a look to updated OP? — ZioByte, Jul 12 '22 at 13:46

Soudipta Dutta · Answer 4 · 2022-12-14T07:47:01.390

from collections import defaultdict
from operator import itemgetter
l1 =[{'id': 1, 'City': 'Calcutta'}, {'id': 3, 'Country': 'Germany'}]
l2 = [{'id': 1, 'Country': 'India'}, {'id': 2, 'City': 'Delhi'}, {'id': 3, 'City': 'Berlin'}]
    
def merge1(l1,l2):
    d = defaultdict(dict)
    for l in (l1, l2):
        for innerdict1 in l:
            d[innerdict1['id']].update(innerdict1)
    
    l4 = sorted(d.values(), key=itemgetter("id"))
    l4p = print(l4)
    return l4p
merge1(l1, l2)
    
"""
[{'id': 1, 'City': 'Delhi', 'Country': 'India'}, {'id': 2, 'City': 'Calcutta'}, {'id': 3, 'Country': 'Germany', 'City': 'Berlin'}]
    
"""
"""
This second one contains a key(denoted by A), common to all. A slight difference in structure. Please take a look.
"""

l1 =[{'A' :{'id': 1, 'City': 'Calcutta'}}, {'A' : {'id': 3, 'Country': 'Germany'}}]
l2 = [{'A' : {'id': 1, 'Country': 'India'}}, {'A':{'id': 2, 'City': 'Delhi'}}, {'A':{'id': 3, 'City': 'Berlin'}}]


def merge2(l1,l2):
    d= defaultdict(dict)
    for l in (l1,l2):
        for innerdict in l :
            d[innerdict['A']['id']].update(innerdict['A'])
    l3 = sorted(d.values(), key = itemgetter('id'))
    l3p = print(l3)
    return l3p
merge2(l1, l2)

Merge two (or more) lists of dictionaries pairing using a specific key

4 Answers4