3

I have a set of different lists of dictionaries (actually obtained reading Excel worksheets) and I need to do an "inner join" on them:

  • each list is equivalent to a database table (each dict is a record)
  • each record has a specific key guaranteed unique in the list (column is "index")
  • I need to produce another list of dictionaries where each dictionary has a given "index" and all other key/value found in all lists where "index" match

To exemplify:

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

and I want yo have:

c = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

of course problem is various list may have different length and not be sorted nicely.

Is there an easy way to do this or should I do nested loops explicitly searching for the matching record?

This actually works, but I'm VERY unsure it's the "most pythonic way":

a = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

c = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

li = [a, b]
t = [{z['idx']: z for z in w} for w in li]
r = {}
for k in t:
    for j in k:
        if j in r:
            r[j].update(k[j])
        else:
            r[j] = k[j]
r = [t for t in r.values()]

print(r)
[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'}, {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'}, {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}, {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]

Can someone come up with something better?

ZioByte
  • 2,690
  • 1
  • 32
  • 68
  • 1
    Input is not a correct data structure – Epsi95 Jul 12 '22 at 12:45
  • 1
    You forgot to post your attempt at solving this problem. – Scott Hunter Jul 12 '22 at 12:46
  • Perhaps this answers your question: https://stackoverflow.com/a/5501893/10226703 – sebtheiler Jul 12 '22 at 12:47
  • 1
    What about keys that only appear in one of `a` and `b`, but not both? – Scott Hunter Jul 12 '22 at 12:50
  • @ScottHunter: I should have a complete set of indexes, each with the "fields" it has. I updated OP to reflect. – ZioByte Jul 12 '22 at 13:10
  • I don't understand the `'fie': 'xx0'` in row 1 of `c` and `'fie': 'xx1'` in row 2 of `c`: Could you explain them? – Timus Jul 12 '22 at 14:04
  • @Timus You mean is it a typo, right? I'm thinking r is result, and c is expected, so it should be c[0]['fie'] == 'zz0' ? – Kenny Ostrom Jul 12 '22 at 14:10
  • @KennyOstrom Yes, exaxtly (yes: typo; yes: your interpretation) – Timus Jul 12 '22 at 14:13
  • It looks like this is basically done the same as I would have done it, using idx as a dict key. It just needs better variable names, maybe defaultdict, maybe itertools.chain. My version is prettier but basically identical. Because it works and there are no bugs or issues to resolve, this probably should be in the codereview forum https://codereview.stackexchange.com/ – Kenny Ostrom Jul 12 '22 at 14:20
  • @Timus: sorry, my bad. I corrected the typo and added the actual output of my code. Question remains: is this (about) the "mot pythonic" way? – ZioByte Jul 12 '22 at 14:23

4 Answers4

2

This is basically the same as your code, as far as the algorithm. You had the right idea using O(1) dict lookup, and update to merge the dicts.

from itertools import chain
from collections import defaultdict
from pprint import pprint

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

KEY = 'idx'
merged = defaultdict(dict)
for row in chain(a, b):
    merged[row[KEY]].update(row)

pprint(list(merged.values()))

I tried not to use any single letter variable names (besides the original inputs)
itertools.chain lets you iterate over several iterables as one
defaultdict hides some of that "if it's in there already, do this, otherwise do that"
[x for x in iterable] could be written list(iterable)
The "merged" data structure is more useful. It's a shame to dump it out to an inefficient list, but that was the requirement.

If possible, you could return merged.values(), which is an iterable view object https://docs.python.org/3.7/library/stdtypes.html?highlight=dict%20values#dictionary-view-objects

Concerns:
This could be handled in a database or pandas, which are designed for this exact function.
What if the rows happen to have a conflict on one of the data fields? You'll never know, as update will just overwrite.

Kenny Ostrom
  • 5,639
  • 2
  • 21
  • 30
1

I'm not sure if this is more efficient than your solution:

from operator import itemgetter
from itertools import chain, groupby

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

c = sorted(a + b, key=itemgetter('idx'))
c = [
    dict(chain(*(record.items() for record in group)))
    for _, group in groupby(c, key=itemgetter('idx'))
]

Result:

[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'zz0', 'fom': 'kk0'},
 {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'zz1', 'fom': 'kk1'},
 {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
 {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]
Timus
  • 10,974
  • 5
  • 14
  • 28
  • sorting is less efficient than simple iteration, and you need the sort for groupby to work – Kenny Ostrom Jul 12 '22 at 14:30
  • @KennyOstrom Yes, that's what I suspect too. – Timus Jul 12 '22 at 14:45
  • 1
    It surely is less understandable for me, bu that's an excellent reason ti dig deeper into `itertools` an friends. Fr this reason (and fact it actually works) I'll accept your answer ;) – ZioByte Jul 12 '22 at 14:46
0

If you are using Python 3.9 you can use the union operator or update() in older versions (added a third shorter list to the example)

a = [{'idx': 0, 'foo': 'xx0', 'bar': 'yy0'}, {'idx': 1, 'foo': 'xx1', 'bar': 'yy1'}, {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'}, {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}, {'idx': 2, 'fie': 'zz2', 'fom': 'kk2'}]
c = [{'idx': 0, 'ief': 'zz0', 'mof': 'kk0'}, {'idx': 1, 'ief': 'zz1', 'mof': 'kk1'}]

lists = [b, c]

# with union
for lst in lists:
    for i, d in enumerate(lst):
        a[i] = a[i] | d

# with update
for lst in lists:
    for i, d in enumerate(lst):
        a[i].update(d)

print(a)

Edit:

If the dictionaries are not sorted or don't have the same keys you can sort the during the merge and add the missing keys

a = [{'idx': 1, 'foo': 'xx1', 'bar': 'yy1'},
     {'idx': 0, 'foo': 'xx0', 'bar': 'yy0'},
     {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'}]
b = [{'idx': 0, 'fie': 'zz0', 'fom': 'kk0'},
     {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'},
     {'idx': 1, 'fie': 'zz1', 'fom': 'kk1'}]

a.sort(key=lambda x: x['idx'])
lists = [b, c]
for lst in lists:
    lst.sort(key=lambda x: x['idx'])
    for i, d in enumerate(lst):
        if d['idx'] == a[i]['idx']:
            a[i] = a[i] | d
        else:
            a.append(d)
print(a)

Output

[{'idx': 0, 'foo': 'xx0', 'bar': 'yy0', 'fie': 'xx0', 'fom': 'kk0'},
 {'idx': 1, 'foo': 'xx1', 'bar': 'yy1', 'fie': 'xx1', 'fom': 'kk1'},
 {'idx': 2, 'foo': 'xx2', 'bar': 'yy2'},
 {'idx': 3, 'fie': 'zz3', 'fom': 'kk3'}]
Guy
  • 46,488
  • 10
  • 44
  • 88
0
from collections import defaultdict
from operator import itemgetter
l1 =[{'id': 1, 'City': 'Calcutta'}, {'id': 3, 'Country': 'Germany'}]
l2 = [{'id': 1, 'Country': 'India'}, {'id': 2, 'City': 'Delhi'}, {'id': 3, 'City': 'Berlin'}]
    
def merge1(l1,l2):
    d = defaultdict(dict)
    for l in (l1, l2):
        for innerdict1 in l:
            d[innerdict1['id']].update(innerdict1)
    
    l4 = sorted(d.values(), key=itemgetter("id"))
    l4p = print(l4)
    return l4p
merge1(l1, l2)
    
"""
[{'id': 1, 'City': 'Delhi', 'Country': 'India'}, {'id': 2, 'City': 'Calcutta'}, {'id': 3, 'Country': 'Germany', 'City': 'Berlin'}]
    
"""
"""
This second one contains a key(denoted by A), common to all. A slight difference in structure. Please take a look.
"""

l1 =[{'A' :{'id': 1, 'City': 'Calcutta'}}, {'A' : {'id': 3, 'Country': 'Germany'}}]
l2 = [{'A' : {'id': 1, 'Country': 'India'}}, {'A':{'id': 2, 'City': 'Delhi'}}, {'A':{'id': 3, 'City': 'Berlin'}}]


def merge2(l1,l2):
    d= defaultdict(dict)
    for l in (l1,l2):
        for innerdict in l :
            d[innerdict['A']['id']].update(innerdict['A'])
    l3 = sorted(d.values(), key = itemgetter('id'))
    l3p = print(l3)
    return l3p
merge2(l1, l2)

        
            
Soudipta Dutta
  • 1,353
  • 1
  • 12
  • 7