finding duplicates in a list of lists

Question

I am using Python 2.7 and am trying to de-duplicate a list of lists and merge the values of the duplicates.

Right now I have:

original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]

I want to match on the first element of each nested list and then add the values of the second element. I want to end up with this (the order of the final list does not matter):

ideal_output = [['a', 2], ['b', 7], ['c', 2]]

So far I have some code that will find me the duplicate values based on the first element of each nested list:

for item in original_list:
    matches = -1
    for x in original_list:
        if (item[0] == x[0]):
            matches += 1
    if matches >= 1: 
        if item[0] not in duplicates_list:
            duplicates_list.append(item[0])

From here I need to search for all duplicates_list items that are in original_list and add up the values, but I am not sure what the best way to do that is.

Note that the most answers so far don't maintain the order of keys, is it important? Would this be ok: `[['b', 7],['a', 2],['c', 2]]`? — georg, Nov 06 '13 at 12:07
Hi, no the order does not matter. Sorry, I should have mentioned that. I'll edit the question. — e h, Nov 06 '13 at 12:09
possible duplicate of [Removing the lists from a list which are duplicated for some items](http://stackoverflow.com/questions/16878062/removing-the-lists-from-a-list-which-are-duplicated-for-some-items) — beroe, Nov 06 '13 at 12:20
@beroe thanks for pointing that question out. I did not see it when searching before. It is similar, but deals with a different matching logic and does not require any summing of values. — e h, Nov 06 '13 at 13:36

Mark Reed · Accepted Answer · 2022-02-21T15:50:56.720

33

Lots of good answers, but they all use rather more code than I would for this, so here's my take, for what it's worth:

totals = {}
for k,v in original_list:
  totals[k] = totals.get(k,0) + v

# totals = {'a': 2, 'c': 2, 'b': 7}

Once you have a dict like that, from any of these answers, you can use items to get a(n object that acts like a) list of tuples:

totals.items()
# => dict_items([('a', 2), ('c', 2), ('b', 7)])

And run list across the tuples to get a list of lists:

[list(t) for t in totals.items()]
# => [['a', 2], ['c', 2], ['b', 7]]

And sort if you want them in order:

sorted([list(t) for t in totals.items()])
# => [['a', 2], ['b', 7], ['c', 2]]

edited Feb 21 '22 at 15:50

answered Nov 06 '13 at 12:02

Mark Reed

91,912
16
138
175

He wants the result as a list. – thefourtheye Nov 06 '13 at 12:05
get your +1, for simplicity, althrough amount of code is negligibly the same, except converting to list and setting up test case – alko Nov 06 '13 at 12:06
3

Thanks, this does exactly what I was trying to do and explaining the code this way helped me understand it quicker. – e h Nov 06 '13 at 12:17
2

could use `totals = collections.defaultdict(int)` and then simply have `totals[k] += v` – Neil G Nov 07 '13 at 18:02
@arshajii Nothing is inherently wrong with it, but (as shown in the various answers to this question) people _love_ to use it when they shouldn't, when it would just make the code more complicated than it needs to be. – Izkata Nov 08 '13 at 17:59

Maciej Gol · Answer 2 · 2013-11-06T12:12:26.233

15

>>> from collections import Counter
>>> lst = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
>>> c = Counter(x for x, c in lst for _ in xrange(c))

Counter({'b': 7, 'a': 2, 'c': 2})

>>> map(list, c.iteritems())
[['a', 2], ['c', 2], ['b', 7]]

Or alternatively, without repeating each item (a, b) b times (@hcwhsa):

>>> from collections import Counter
>>> lst = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
>>> c = sum((Counter(**{k:v}) for k, v in lst), Counter())

Counter({'b': 7, 'a': 2, 'c': 2})

>>> map(list, c.iteritems())
[['a', 2], ['c', 2], ['b', 7]]

edited Nov 06 '13 at 12:12

answered Nov 06 '13 at 11:57

Maciej Gol

15,394
4
33
51

4

Or : `sum((Counter(**{k:v}) for k, v in lst), Counter())` – Ashwini Chaudhary Nov 06 '13 at 12:05
Note that if the list contains some key like `['a', 10000]` then instead of just simple summing this solution will iterate 10000 times, which is inefficient and in that case alko's solution is better. – Ashwini Chaudhary Nov 06 '13 at 12:08
2

@hcwhsa I think we dont need `**` – thefourtheye Nov 06 '13 at 12:11
@kroolik look for performance tests (in my answer), we are outpaced :) – alko Nov 06 '13 at 21:03
@alko, perhaps because we are using `Counter`, which is implemented entirely in Python, compared to other solutions ([see repo](http://hg.python.org/releasing/2.7.6/file/ba31940588b6/Lib/collections.py#l387)). If performance matters that much, I suggest writting a module in C to do that properly :P – Maciej Gol Nov 06 '13 at 21:14

alko · Answer 3 · 2013-11-06T19:51:13.290

SOLUTION

Use collections.Counter:

from collections import Counter
original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
result = Counter()
for k, v in original_list:
     result.update({k:v})

map(list, result.items())
# [['a', 2], ['c', 2], ['b', 7]]

FINDINGS

So, lot of answers, views and upvotes. I even earned my first Nice answer out of nothing (in last 2 days I made lot of answers worth of more research and efforts). In view of this, I decided to do at least some research and test solutions performance with a simple script written from scratch. Do not include code directly in answer for the sake of size.

Each function is named for it's author an easily can be found in question. thefourtheye's solution now equals one of Mark Reed and is evaluated in original form, thefourtheye2 states for itertools.groupby based solution.

Each was tested several times (samples), each sample in turn invoked several function iterations. I evaluated min, max and standard deviation for samples times.

Here we go, running probing test for 10 times.

testing: thefourtheye, kroolik2, void, kroolik, alko, reed, visser
   10 samples
   10 iterations each
         author   min     avg     max    stddev
           reed 0.00000 0.00000 0.00000 0.00000
         visser 0.00000 0.00150 0.01500 0.00450
   thefourtheye 0.00000 0.00160 0.01600 0.00480
  thefourtheye2 0.00000 0.00310 0.01600 0.00620
           alko 0.00000 0.00630 0.01600 0.00772
           void 0.01500 0.01540 0.01600 0.00049
       kroolik2 0.04700 0.06430 0.07800 0.00831
        kroolik 0.32800 0.34380 0.37500 0.01716

Look at bottom two rows: at this point kroolik solutions were disqualified since with it any reasonable amount of samples*iterations will be performed for hours. Here goes final tests. I manually added number of upvotes to ouptut:

testing: thefourtheye, kroolik2, void, kroolik, alko, reed, visser
   100 samples
  1000 iterations each
         author  upvotes   min     avg     max    stddev
           reed  [20]    0.06200 0.08174 0.15600 0.01841
   thefourtheye   [5]    0.06200 0.09971 0.20300 0.01911
         visser   [6]    0.10900 0.12392 0.23500 0.02263
  thefourtheye2          0.25000 0.29674 0.89000 0.07183
           alko  [11]    0.56200 0.62309 1.04700 0.08438
           void   [3]    1.50000 1.65480 2.39100 0.18721
        kroolik  [14]     [DSQ]

how comes this answer gets +2 just the moment I posted it? even loosing to kroolik by some seconds :) — alko, Nov 06 '13 at 12:00
Why are you doing `result.update({k:v})`? SIt's equivalent to `result[k] += v` but with higher overhead (creation of a `dict` plus a method call with lookup are avoided using `+=`). — Bakuriu, Nov 06 '13 at 13:46

thefourtheye · Answer 4 · 2013-11-06T12:16:59.323

10

If the order doesnt matter, you can use this

original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
myDict = {}
for first, second in original_list:
    myDict[first] = myDict.get(first, 0) + second
result = [[key, value] for key, value in myDict.items()]
print result

Or you can use groupby and the code becomes a oneliner

original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
from itertools import groupby
print [[key, sum(item[1] for item in list(group))]
       for key, group in groupby(sorted(original_list), lambda x:x[0])]

Output

[['a', 2], ['b', 7], ['c', 2]]

edited Nov 06 '13 at 12:16

answered Nov 06 '13 at 11:55

thefourtheye

233,700
52
457
497

with collections.defaultdict you get get rid of setdefault() – lucasg Nov 06 '13 at 11:57
1

As georgesl mentioned you should use `defaultdict(int)` or `dict.get`( if you want to use a normal `dict`). I think `dict.setdefault` should be used when you're assigning a mutable default value. Anyways +1. – Ashwini Chaudhary Nov 06 '13 at 12:15

score 5 · Answer 5 · answered Nov 06 '13 at 11:56

5

You can use collections.defaultdict:

original_list = [['a', 1], ['b', 1], ['a', 1], ['b', 1], ['b', 2], ['c', 2], ['b', 3]]
import collections
data = collections.defaultdict(list)
for item in original_list:
    data[item[0]].append(item[1])

output = {key: sum(values) for key, values in data.items()}
print output
# gives: {'a': 2, 'c': 2, 'b': 7}

answered Nov 06 '13 at 11:56

Simeon Visser

118,920
18
185
180

It's more efficient to use `defaultdict(int)` – Neil G Nov 07 '13 at 18:06

score 5 · Answer 6 · 2013-11-06T12:51:45.840

5

I know it's ugly but I was having some fun trying to implement it in a 1 liner:

map(list, set(([(x[0], sum([i[1] for i in original_list if i[0]==x[0]])) for x in original_list])))

output:

[['a', 2], ['b', 7], ['c', 2]]

edited Nov 06 '13 at 12:51

answered Nov 06 '13 at 12:32

score 4 · Answer 7 · edited Nov 06 '13 at 16:27

4

May be you can try this too,

>>> x = [[1,1],[2,2],[1,1],[2,2],[3,3],[4,4],[4,4]]
>>> z = []
>>> for i in x:
>>>    if i not in z:
>>>        z.append(i)
>>>
>>> z
[[1, 1], [2, 2], [3, 3], [4, 4]]

edited Nov 06 '13 at 16:27

orlp

112,504
36
218
315

answered Nov 06 '13 at 14:24

rajpython

179
2
3
14

finding duplicates in a list of lists

7 Answers7

SOLUTION

FINDINGS

Linked