Python: Elegantly merge dictionaries with sum() of values

Question

I'm trying to merge logs from several servers. Each log is a list of tuples (date, count). date may appear more than once, and I want the resulting dictionary to hold the sum of all counts from all servers.

Here's my attempt, with some data for example:

from collections import defaultdict

a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input=[a,b,c]

output=defaultdict(int)
for d in input:
        for item in d:
           output[item[0]]+=item[1]
print dict(output)

Which gives:

{'14.5': 100, '16.5': 100, '13.5': 100, '15.5': 200}

As expected.

I'm about to go bananas because of a colleague who saw the code. She insists that there must be a more Pythonic and elegant way to do it, without these nested for loops. Any ideas?

@AshwiniChaudhary: `Counter()` only counts occurences, and as the values are already pre-populated it will not work for this scenario. — Christian Witts, Jul 02 '12 at 08:37
What seems unpythonic to me is spending time worrying about making perfectly clear code more pythonic. The Pythonicness which needs hours of thought is not the true Pythonicness. — DSM, Jul 02 '12 at 08:53
unnecessary 1-liner (deserves complaints, unlike your code): `dict([(k, sum([v2 for k2,v2 in a+b+c if k2==k])) for k,v in a+b+c])` — krethika, Jan 31 '13 at 19:32

Kos · Accepted Answer · 2012-07-02T12:26:20.523

Doesn't get simpler than this, I think:

a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input=[a,b,c]

from collections import Counter

print sum(
    (Counter(dict(x)) for x in input),
    Counter())

Note that Counter (also known as a multiset) is the most natural data structure for your data (a type of set to which elements can belong more than once, or equivalently - a map with semantics Element -> OccurrenceCount. You could have used it in the first place, instead of lists of tuples.

Also possible:

from collections import Counter
from operator import add

print reduce(add, (Counter(dict(x)) for x in input))

Using reduce(add, seq) instead of sum(seq, initialValue) is generally more flexible and allows you to skip passing the redundant initial value.

Note that you could also use operator.and_ to find the intersection of the multisets instead of the sum.

The above variant is terribly slow, because a new Counter is created on every step. Let's fix that.

We know that Counter+Counter returns a new Counter with merged data. This is OK, but we want to avoid extra creation. Let's use Counter.update instead:

update(self, iterable=None, **kwds) unbound collections.Counter method

Like dict.update() but add counts instead of replacing them. Source can be an iterable, a dictionary, or another Counter instance.

That's what we want. Let's wrap it with a function compatible with reduce and see what happens.

def updateInPlace(a,b):
    a.update(b)
    return a

print reduce(updateInPlace, (Counter(dict(x)) for x in input))

This is only marginally slower than the OP's solution.

Benchmark: http://ideone.com/7IzSx (Updated with yet another solution, thanks to astynax)

(Also: If you desperately want an one-liner, you can replace updateInPlace by lambda x,y: x.update(y) or x which works the same way and even proves to be a split second faster, but fails at readability. Don't :-))

What about time complexity? Is it more efficient than OP's code — jerrymouse, Jul 02 '12 at 09:11
I don't think so. The OP's code doesn't create any immediate objects, so it should generally be more efficient. — Kos, Jul 02 '12 at 09:16
This approach is indeed the slowest of what was presented here, with OP's original solution being the fastest. No surprises here. http://ideone.com/HAmvi — Kos, Jul 02 '12 at 09:37
Fixed with a more efficient approach. Benchmark: http://ideone.com/2eGQi — Kos, Jul 02 '12 at 09:56
+1 That's truly elegant. Thanks for presenting this data structure - it fits the problem like a glove; Plus there is not enough data to care about performance. — Adam Matan, Jul 02 '12 at 10:04
@Kos, i slightly changed the [benchmark](http://ideone.com/7IzSx). Fastest way: `defaultdict` + `chain` (and my rewrited `merge_with` in the middle) :) — Aleksei astynax Pirogov, Jul 02 '12 at 10:42

score 11 · Answer 2 · edited May 23 '17 at 12:18

11

from collections import Counter


a = [("13.5",100)]
b = [("14.5",100), ("15.5", 100)]
c = [("15.5",100), ("16.5", 100)]

inp = [dict(x) for x in (a,b,c)]
count = Counter()
for y in inp:
  count += Counter(y)
print(count)

output:

Counter({'15.5': 200, '14.5': 100, '16.5': 100, '13.5': 100})

Edit: As duncan suggested you can replace these 3 lines with a single line:

   count = Counter()
    for y in inp:
      count += Counter(y)

replace by : count = sum((Counter(y) for y in inp), Counter())

edited May 23 '17 at 12:18

Community

1
1

answered Jul 02 '12 at 08:41

Ashwini Chaudhary

244,495
58
464
504

3

You could even remove the `for` loop by using `sum`: `count = sum((Counter(y) for y in inp), Counter())` – Duncan Jul 02 '12 at 09:19
@Duncan thanks's I never knew that, suggestion implemented. – Ashwini Chaudhary Jul 02 '12 at 09:28
Or just `count = sum(inp, Counter())` if they're already Counters. :D – endolith Sep 06 '19 at 20:57

sloth · Answer 3 · 2012-07-02T08:52:00.193

7

You could use itertools' groupby:

from itertools import groupby, chain

a=[("13.5",100)]
b=[("14.5",100), ("15.5", 100)]
c=[("15.5",100), ("16.5", 100)]
input = sorted(chain(a,b,c), key=lambda x: x[0])

output = {}
for k, g in groupby(input, key=lambda x: x[0]):
  output[k] = sum(x[1] for x in g)

print output

The use of groupby instead of two loops and a defaultdict will make your code clearer.

edited Jul 02 '12 at 08:52

answered Jul 02 '12 at 08:35

sloth

99,095
21
171
219

2

instead of the lambda, you can also drop in an `operator.itemgetter(0)` :) – Kos Jul 02 '12 at 08:43
1

Wrong: `groupby`, as said in the doc you mention, needs sorting first ! Here this works because `b[1]` and `c[0]` will be consecutive in `chain(a,b,c)` but if you do `chain(a,c,b)` instead, the result is not correct (you get 100 instead of 200 for `output['15.5']`)... – Emmanuel Jul 02 '12 at 08:49
1

I guess its personal taste, but I find this harder to read than defaultdict, and also its slower than the OP approach – fraxel Jul 02 '12 at 08:49
You're welcome, good to have thought about `groupby` anyway ! – Emmanuel Jul 02 '12 at 08:56
@Emmanuel Yeah, it just seemed natural to use it. But also I think there's nothing wrong with the code the questioneer uses. As fraxel said: a matter of personal taste in the end. – sloth Jul 02 '12 at 09:01
BTW This is sub-optimal because it needs to have all the input sequences in memory at once **and** needs to sort the whole chain of them. – Kos Jul 02 '12 at 09:02
Hah, profiling ftw. I said sub-optimal, but still faster than mine. :) – Kos Jul 02 '12 at 09:29

score 2 · Answer 4 · edited Apr 21 '23 at 11:15

You can use Counter or defaultdict, or you can try my variant:

def merge_with(d1, d2, fn=lambda x, y: x + y):
    res = d1.copy() # "= dict(d1)" for lists of tuples
    for key, val in d2.items(): # ".. in d2" for lists of tuples
        try:
            res[key] = fn(res[key], val)
        except KeyError:
            res[key] = val
    return res

>>> merge_with({'a':1, 'b':2}, {'a':3, 'c':4})
{'a': 4, 'c': 4, 'b': 2}

Or even more generic:

def make_merger(fappend=lambda x, y: x + y, fempty=lambda x: x):
    def inner(*dicts):
        res = dict((k, fempty(v)) for k, v
            in dicts[0].items()) # ".. in dicts[0]" for lists of tuples
        for dic in dicts[1:]:
            for key, val in dic.items(): # ".. in dic" for lists of tuples
                try:
                    res[key] = fappend(res[key], val)
                except KeyError:
                    res[key] = fempty(val)
        return res
    return inner

>>> make_merger()({'a':1, 'b':2}, {'a':3, 'c':4})
{'a': 4, 'c': 4, 'b': 2}

>>> appender = make_merger(lambda x, y: x + [y], lambda x: [x])
>>> appender({'a':1, 'b':2}, {'a':3, 'c':4}, {'b':'BBB', 'c':'CCC'})
{'a': [1, 3], 'c': [4, 'CCC'], 'b': [2, 'BBB']}

Also you can subclass the dict and implement a __add__ method:

Python: Elegantly merge dictionaries with sum() of values

4 Answers4

Linked

Related