48

Given n lists with m dictionaries as their elements, I would like to produce a new list, with a joined set of dictionaries. Each dictionary is guaranteed to have a key called "index", but could have an arbitrary set of keys beyond that. The non-index keys will never overlap across lists. For example, imagine the following two lists:

l1 = [{"index":1, "b":2}, {"index":2, "b":3}, {"index":3, "green":"eggs"}]
l2 = [{"index":1, "c":4}, {"index":2, "c":5}]

("b" would never appear in l2, since it appeared in l1, and similarly, "c" would never appear in l1, since it appeared in l2)

I would like to produce a joined list:

l3 = [{"index":1, "b":2, "c":4}, 
      {"index":2, "b":3, "c":5}, 
      {"index":3, "green":"eggs"}]

What is the most efficient way to do this in Python?

agiledatabase
  • 29
  • 1
  • 4
Bacon
  • 2,155
  • 4
  • 23
  • 31
  • Is it guaranteed that the value of the "index" entry in the dict will match the position of that dict in the list? – Dave Costa Mar 31 '11 at 15:04
  • Nope - It is not guaranteed that "index" would match the position of the dict in the list. – Bacon Mar 31 '11 at 15:08

3 Answers3

58
from collections import defaultdict

l1 = [{"index":1, "b":2}, {"index":2, "b":3}, {"index":3, "green":"eggs"}]
l2 = [{"index":1, "c":4}, {"index":2, "c":5}]

d = defaultdict(dict)
for l in (l1, l2):
    for elem in l:
        d[elem['index']].update(elem)
l3 = d.values()

# l3 is now:

[{'b': 2, 'c': 4, 'index': 1},
 {'b': 3, 'c': 5, 'index': 2},
 {'green': 'eggs', 'index': 3}]

EDIT: Since l3 is not guaranteed to be sorted (.values() returns items in no specific order), you can do as @user560833 suggests:

from operator import itemgetter

...

l3 = sorted(d.values(), key=itemgetter("index"))
eumiro
  • 207,213
  • 34
  • 299
  • 261
  • Was about to implement something very close to this. – Andrea Spadaccini Mar 31 '11 at 15:00
  • 2
    You will need to sort l3 afterwards - it is not guaranteed that the list will be in order of the index. e.g. `from operator import itemgetter; l3.sort(key=itemgetter("index"))` – Dave Kirby Mar 31 '11 at 16:36
  • 1
    Instead off `for l in (l1, l2): for elem in l:` it is better to use directly `for elem in itertools.chain(l1, l2)` – klis87 Jun 07 '16 at 17:36
  • This is quite efficient and readable, and as a bonus doesn't require any invocation of zip functions (which tend to get hairy quickly IME). Compared to a dict comprehension with a `next` generator to fetch the corresponding key, this was several orders of magnitude faster. – bsplosion May 11 '21 at 21:39
18

In python 3.5 or higher, you can merge dictionaries in a single statement.

So for python 3.5 or higher, a quick solution would be:

from itertools import zip_longest

l3 = [{**u, **v} for u, v in zip_longest(l1, l2, fillvalue={})]

print(l3)
#[
#    {'index': 1, 'b': 2, 'c': 4}, 
#    {'index': 2, 'b': 3, 'c': 5}, 
#    {'index': 3, 'green': 'eggs'}
#]

However if the two lists were the same size, you could simply use zip:

l3 = [{**u, **v} for u, v in zip(l1, l2)]

Note: This assumes that the lists are sorted the same way by index, which is stated by OP to not be the case in general.

In order to generalize for that case, one way is to create a custom zip-longest type function which yields values from the two lists only if they match on a key.

For instance:

def sortedZipLongest(l1, l2, key, fillvalue={}):  
    l1 = iter(sorted(l1, key=lambda x: x[key]))
    l2 = iter(sorted(l2, key=lambda x: x[key]))
    u = next(l1, None)
    v = next(l2, None)

    while (u is not None) or (v is not None):  
        if u is None:
            yield fillvalue, v
            v = next(l2, None)
        elif v is None:
            yield u, fillvalue
            u = next(l1, None)
        elif u.get(key) == v.get(key):
            yield u, v
            u = next(l1, None)
            v = next(l2, None)
        elif u.get(key) < v.get(key):
            yield u, fillvalue
            u = next(l1, None)
        else:
            yield fillvalue, v
            v = next(l2, None)

Now if you had the following out of order lists:

l1 = [{"index":1, "b":2}, {"index":2, "b":3}, {"index":3, "green":"eggs"}, 
      {"index":4, "b": 4}]
l2 = [{"index":1, "c":4}, {"index":2, "c":5}, {"index":0, "green": "ham"}, 
      {"index":4, "green": "ham"}]

Using the sortedZipLongest function instead of itertools.zip_longest:

l3 = [{**u, **v} for u, v in sortedZipLongest(l1, l2, key="index", fillvalue={})]
print(l3)
#[{'index': 0, 'green': 'ham'},
# {'index': 1, 'b': 2, 'c': 4},
# {'index': 2, 'b': 3, 'c': 5},
# {'index': 3, 'green': 'eggs'},
# {'index': 4, 'b': 4, 'green': 'ham'}]

Whereas original method would produce the incorrect answer:

l3 = [{**u, **v} for u, v in zip_longest(l1, l2, fillvalue={})]
print(l3)
#[{'index': 1, 'b': 2, 'c': 4},
# {'index': 2, 'b': 3, 'c': 5},
# {'index': 0, 'green': 'ham'},
# {'index': 4, 'b': 4, 'green': 'ham'}]
pault
  • 41,343
  • 15
  • 107
  • 149
  • 1
    Excellent solution for `sortedZipLongest`. I think you can simplify it by replacing all the `None` values with `fillvalue` or simply `{}` since `fillvalue` has to be a dictionary to be unpacked later on using `**`. – pmsoltani Jan 17 '21 at 14:32
1

Here's a one-liner that does this:

[dict(sum([z.items() for z in z2],[])) for z2 in [[x3 for x3 in l1+l2 if x3['index']==key] for key in set([x1['index'] for x1 in l1]+[x2['index'] for x2 in l2])]]

Not quite as elegant as a list-comprehension. I don't think the result is guaranteed to necessarily be sorted the way you want either.

Expanding the one-liner:

[
    dict(sum([z.items() for z in z2],[])) 
    for z2 in [
        [
            x3 for x3 in l1+l2 if x3['index']==key
        ] for key in set(
            [x1['index'] for x1 in l1]+[x2['index'] for x2 in l2]
        )
    ]
]

The set expression on the 6th line gets all the unique index values from both lists. The list comprehension around that (lines 3-9) creates a list of lists where each inner list is a combined list of dictionaries for that index/key with a particular index value. The outermost list comprehension creates a single list of tuple-pairs for each key and converts it back to a list of dictionaries.

Mark
  • 19
  • 1
  • Using `sum` to flatten lists is inefficient- you should use `itertools.chain` instead: [why sum on lists is (sometimes) faster than itertools.chain?](https://stackoverflow.com/a/41772165/5858851) – pault Jul 19 '19 at 18:58