0

I have a sorted list of dictionaries like so:

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

This is effectively (id1, id2, value) tuples, but where there are duplicates. I would like to deduplicate these by summing the values where both ids are equal, leaving me with unique (id1, id2) pairs where the new value is the sum of the dupes.

That is, from above, the desired output is:

dat =[
     {'id1': 1, 'id2': 2, 'value': 3},
     {'id1': 2, 'id2': 2, 'value': 2},
     {'id1': 2, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 3, 'value': 1},
     {'id1': 3, 'id2': 4, 'value': 4}
     ]

Assume the list is millions with lots of duplicates. What's the most efficient way to do this using itertools or funcy (versus say, using pandas)?

Mittenchops
  • 18,633
  • 33
  • 128
  • 246

3 Answers3

2

You can start with collections.Counter and use the += operator, the convenient part of the Counter is that += assumes zero on inexisting keys.

dat = [
      {"id1": 1, "id2": 2, "value": 1},
      {"id1": 1, "id2": 2, "value": 2},
      {"id1": 2, "id2": 2, "value": 2},
      {"id1": 2, "id2": 3, "value": 1},
      {"id1": 3, "id2": 3, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      {"id1": 3, "id2": 4, "value": 1},
      ]

from collections import Counter
cnt = Counter()
for item in dat:
  cnt[item["id1"], item["id2"]] += item["value"]

[{'id1':id1, 'id2': id2, 'value':v}for (id1, id2), v in cnt.items()]

Giving

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]
Bob
  • 13,867
  • 1
  • 5
  • 27
2

We could use collections.defaultdict as well:

from collections import defaultdict
tmp = defaultdict(int)
for d in dat:
    tmp[d['id1'], d['id2']] += d['value']
out = [{'id1':id1, 'id2':id2, 'value':v} for (id1, id2), v in tmp.items()]

or (assuming the ids are sorted), itertools.groupby:

from itertools import groupby
out = [{'id1': k1, 'id2': k2, 'value': sum(d['value'] for d in g)} for (k1,k2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]

or groupby + sum + to_dict in pandas:

out = pd.DataFrame(dat).groupby(['id1','id2'], as_index=False)['value'].sum().to_dict('records')

Output:

[{'id1': 1, 'id2': 2, 'value': 3},
 {'id1': 2, 'id2': 2, 'value': 2},
 {'id1': 2, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 3, 'value': 1},
 {'id1': 3, 'id2': 4, 'value': 4}]


A basic benchmark on the provided data says groupby using itemgetter (as suggested by @ShadowRanger) is the fastest:

  1. defaultdict: 6.57 µs ± 491 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  2. Counter (Bob): 9.56 µs ± 1.47 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  3. groupby + itemgetter (ShadowRanger): 6.01 µs ± 182 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  4. groupby + lambda: 9.02 µs ± 598 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
  5. pandas: 3.81 ms ± 68.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Now if we duplicate dat 1mil times, i.e. Do

dat = dat*1_000_000
dat.sort(key=itemgetter('id1', 'id2'))

and do the same benchmark again, groupby with itemgetter is the runaway winner:

  1. 3.91 s ± 320 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  2. 5.38 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  3. 1.77 s ± 128 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  4. 3.53 s ± 199 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
  5. 15.2 s ± 831 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

ran on Python 3.9.7 (64bit).

This benchmark somewhat favors groupby since there are very few groups when we duplicate an existing small list of dicts. If create randomize the size of "group"s, groupby + itemgetter still beats all but the difference is not as stark.

  • 1
    Note: You can mildly improve the `itertools.groupby` solution to unpack the key the same way you do for the listcomp in your `defaultdict` code, e.g. `out = [{'id1': id1, 'id2': id2, 'value': sum(d['value'] for d in g)} for (id1, id2), g in groupby(dat, lambda x: (x['id1'], x['id2']))]`. I personally like using `operator.itemgetter` here over `lambda`s and manual lookups, but admittedly, in the most modern Python releases, the performance advantage to using pre-built `itemgetter`s over `lambda` has mostly evaporated (though I still prefer them for clarity). – ShadowRanger Mar 02 '22 at 21:01
  • 1
    Also, a minor side-note for future readers: `Counter` will win for cases where you're counting a large iterable of "things to count" (in CPython at least, where it has an accelerator specifically to aid that use case), vs. this case where it's summing existing counts (and loses). But when it's being used as a glorified `defaultdict(int)` with no `Counter` specific features, yeah, `Counter` loses (aside from the accelerator it's largely implemented in Python level code, where `defaultdict(int)` is purely C-level on the CPython reference interpreter). – ShadowRanger Mar 02 '22 at 21:04
  • @ShadowRanger thanks, it's very informative. And `groupby` with `itemgetter` indeed has much better performance than the other options. –  Mar 02 '22 at 22:36
  • @enke: For posterity, can you include the version of Python you ran the benchmarks under (including 32 vs. 64 bittedness). I'm surprised `groupby`+`itemgetter` did so well relative to `groupby`+`lambda` because last I'd checked they'd optimized `lambda`s so much (using the new vectorcall protocol, which `itemgetter` isn't using yet) that `itemgetter` doesn't win by as much (and sometimes loses). I don't recall exactly when the changes occurred though (somewhere in the 3.7-3.10 timeframe, possibly incrementally). – ShadowRanger Mar 03 '22 at 00:31
1

Just for fun, a purely itertools solution (no use of collections or otherwise using any intermediate containers that must be built and updated incrementally if the list is already in key order, though it requires a pre-sort if you can't guaranteed it's already sorted to group unique id pairs together):

# At top of file
from itertools import groupby

# Also at top of file; not strictly necessary, but I find it's nicer to make cheap getters
# with self-documenting names
from operator import itemgetter
get_ids = itemgetter('id1', 'id2')
get_value = itemgetter('value')

# On each use:
dat.sort(key=get_ids)  # Not needed if data guaranteed grouped by unique id1/id2 pairs as in example

dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(dat, key=get_ids)]

# If sorting needed, you can optionally one-line as the rather overly dense (I don't recommend it):
dat = [{'id1': id1, 'id2': id2, 'value': sum(map(get_value, group))}
       for (id1, id2), group in groupby(sorted(dat, key=get_ids), key=get_ids)]

Personally, I'd generally use Counter or defaultdict(int) as shown in the other answers, as they get O(n) performance even with unsorted data (groupby is O(n), but if you need to sort first, the sorting is O(n log n)). Basically the only time this even has a theoretical advantage is when the data is already sorted and you value using a one-liner (excluding imports and one-time setup cost to make itemgetters); in practice, itertools.groupby has sufficient overhead that it still typically loses to one or both of collections.Counter/collections.defaultdict(int), especially when using collections.Counter in its optimized modes for counting iterables of things to count (that don't apply here, but are worth knowing about).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271