173

I have a list of lists in Python:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

And I want to remove duplicate elements from it. Was if it a normal list not of lists I could used set. But unfortunate that list is not hashable and can't make set of lists. Only of tuples. So I can turn all lists to tuples then use set and back to lists. But this isn't fast.

How can this done in the most efficient way?

The result of above list should be:

k = [[5, 6, 2], [1, 2], [3], [4]]

I don't care about preserve order.

Note: this question is similar but not quite what I need. Searched SO but didn't find exact duplicate.


Benchmarking:

import itertools, time


class Timer(object):
    def __init__(self, name=None):
        self.name = name

    def __enter__(self):
        self.tstart = time.time()

    def __exit__(self, type, value, traceback):
        if self.name:
            print '[%s]' % self.name,
        print 'Elapsed: %s' % (time.time() - self.tstart)


k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] * 5
N = 100000

print len(k)

with Timer('set'):
    for i in xrange(N):
        kt = [tuple(i) for i in k]
        skt = set(kt)
        kk = [list(i) for i in skt]


with Timer('sort'):
    for i in xrange(N):
        ks = sorted(k)
        dedup = [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]


with Timer('groupby'):
    for i in xrange(N):
        k = sorted(k)
        dedup = list(k for k, _ in itertools.groupby(k))

with Timer('loop in'):
    for i in xrange(N):
        new_k = []
        for elem in k:
            if elem not in new_k:
                new_k.append(elem)

"loop in" (quadratic method) fastest of all for short lists. For long lists it's faster then everyone except groupby method. Does this make sense?

For short list (the one in the code), 100000 iterations:

[set] Elapsed: 1.3900001049
[sort] Elapsed: 0.891000032425
[groupby] Elapsed: 0.780999898911
[loop in] Elapsed: 0.578000068665

For longer list (the one in the code duplicated 5 times):

[set] Elapsed: 3.68700003624
[sort] Elapsed: 3.43799996376
[groupby] Elapsed: 1.03099989891
[loop in] Elapsed: 1.85900020599
martineau
  • 119,623
  • 25
  • 170
  • 301
zaharpopov
  • 16,882
  • 23
  • 75
  • 93
  • 1
    By "this isn't fast", do you mean that you have timed it and it is not fast enough for your application, or that you think it is not fast? – Torsten Marek Feb 06 '10 at 17:23
  • @Torsten, it just seems like too much copying to be smart method. sorry, gut feeling. copy lists to tuples, then into set, then back to list of lists (copy again tuples to lists) – zaharpopov Feb 06 '10 at 17:28
  • @zaharpopov: thats not how Python works, nothing will be *copied*, just new containers for the existing elements ( though for ints, it's pretty much the same ) – Jochen Ritzel Feb 06 '10 at 17:31
  • 3
    1. the timings for the methods using sorting are deflated, because "k" is rebound to the sorted variant. 2. The last method is faster because the way you generate the test data leaves you with at most 4 distinct elements. Try sth. like K = [[int(u) for u in str(random.randrange(1, 1000))] for _ in range(100)] – Torsten Marek Feb 06 '10 at 18:16
  • @Torsten: fixed thanks. but still the loop method is fast even when there is only one duplicate in list of 10 – zaharpopov Feb 06 '10 at 18:25
  • @zaharpopov: Yes, for small input lists, "loop" will be faster. Its complexity is O(mn), where m is the number of unique elements (of n). So the less unique elements there are, the faster it is (i.e. linear for lists with only one unique element). For short lists, the constant factors of both "sort" and "groupby" are too high for them to be faster than "loop". – Torsten Marek Feb 06 '10 at 18:33
  • @zaharpopov, to measure performance in one specific case, use standard library module `timeit` _at a shell prompt_ (simplest and soundest, much better than embedding it in code). I'll edit my A to show this. – Alex Martelli Feb 06 '10 at 18:55
  • @zaharpopov: as Torsten says, the loop is fast with small samples. That's quite normal in algorithms that incur in small constant costs (creating an empty list vd. creating a list from a tuple). But it still grows quadratically, meaning that the linear methods will be faster than the loop at some point. – Ricardo Cárdenes Feb 06 '10 at 19:36
  • I'd just like to point out that you're only testing on what I would call "short" lists. A "long" list for me would have perhaps 10k or 100k elements, perhaps a few million. – drevicko Jun 19 '18 at 07:15

17 Answers17

239
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]

itertools often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!-)

Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the big-O approach) because it's so much easier that it offers good returns on efforts. But sometimes (essentially for "tragically crucial bottlenecks" in deep inner loops of code that's pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one's apps), performing possibly-heuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.

Careful measurements of "point" performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit helps here. However, it's easier to use it at a shell prompt. For example, here's a short module to showcase the general approach for this problem, save it as nodup.py:

import itertools

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

def doset(k, map=map, list=list, set=set, tuple=tuple):
  return map(list, set(map(tuple, k)))

def dosort(k, sorted=sorted, xrange=xrange, len=len):
  ks = sorted(k)
  return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i-1]]

def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
  ks = sorted(k)
  return [i for i, _ in itertools.groupby(ks)]

def donewk(k):
  newk = []
  for i in k:
    if i not in newk:
      newk.append(i)
  return newk

# sanity check that all functions compute the same result and don't alter k
if __name__ == '__main__':
  savek = list(k)
  for f in doset, dosort, dogroupby, donewk:
    resk = f(k)
    assert k == savek
    print '%10s %s' % (f.__name__, sorted(resk))

Note the sanity check (performed when you just do python nodup.py) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.

Now we can run checks on the tiny example list:

$ python -mtimeit -s'import nodup' 'nodup.doset(nodup.k)'
100000 loops, best of 3: 11.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort(nodup.k)'
100000 loops, best of 3: 9.68 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby(nodup.k)'
100000 loops, best of 3: 8.74 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.donewk(nodup.k)'
100000 loops, best of 3: 4.44 usec per loop

confirming that the quadratic approach has small-enough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:

$ python -mtimeit -s'import nodup' 'nodup.donewk([[i] for i in range(12)])'
10000 loops, best of 3: 25.4 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dogroupby([[i] for i in range(12)])'
10000 loops, best of 3: 23.7 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.doset([[i] for i in range(12)])'
10000 loops, best of 3: 31.3 usec per loop
$ python -mtimeit -s'import nodup' 'nodup.dosort([[i] for i in range(12)])'
10000 loops, best of 3: 25 usec per loop

the quadratic approach isn't bad, but the sort and groupby ones are better. Etc, etc.

If (as the obsession with performance suggests) this operation is at a core inner loop of your pushing-the-boundaries application, it's worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).

It's also well worth considering keeping a different representation for k -- why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program's performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • 1
    @alex thanks for alternative. this method about same speed as danben's, a few % faster – zaharpopov Feb 06 '10 at 17:52
  • @alex: strangely this is slower than a naive quadratic method for shorter lists (see question edit) – zaharpopov Feb 06 '10 at 18:01
  • @zaharpopov: it's that way only in your special case, cf. my comment to the question. – Torsten Marek Feb 06 '10 at 18:19
  • @zaharpopov, if you give a probability distribution of list and sublist lengths and chance of duplicates, it's possible (with huge effort) to compute/measure the runtimes probability distribution for any given code and optimize whatever measure you need (median, mean, 90th centile, whatever). It's hardly ever done because of very low ROI: normally one focuses on the much-easier case of large inputs (the big-O approach), where inferior algorithms would really hurt performance terribly. And I don't see you specify any probability distributions in your Q anyway;-). – Alex Martelli Feb 06 '10 at 18:53
  • 1
    Are you using `k` as the list's name and `k` as the iterator variable? – emmagras Apr 28 '15 at 17:04
  • @emmagras, no, why do you ask? – Alex Martelli Apr 30 '15 at 22:46
  • 1
    In `list(k for k,_ in itertools.groupby(k))`, I'm pretty sure the `k` in `list(k for k...` is different from the `k` in `grouby(k)`, and if I were more sure than that'd I'd suggest renaming the list. My question above might not be using the right vocabulary – emmagras May 01 '15 at 14:17
  • 1
    I am also confused by the k, how does that work? Or are they different and could I also write: `list(x for x,_ in itertools.groupby(k))` ? – Sebastian Oct 12 '16 at 10:47
  • @AlexMartelli this works for my issue, however, I'm completely bamboozled by the 'k,_'. This is a list comprehension? Can you explain or point me to literature please? – spacedustpi Aug 23 '18 at 20:17
  • **caveat**: the `k.sort()` function will remove orders in this list. I did not realize this and had to spend hours debugging, as in my user case orders matter. To solve it, before sorting you can use `copy.copy()` to make a deep copy of the list! – yuqli Apr 26 '19 at 19:09
  • @yuqli And how do i get the unique list with the preserved order when I have this copy? – xuiqzy Oct 01 '20 at 12:50
  • thus, if `len(list(k for k,_ in itertools.groupby(k))) == len(list(k for k,_ in itertools.groupby(k)))` there are no duplicates – seralouk Mar 17 '21 at 18:08
  • @yuqli, copy.copy(x) Returns a shallow copy of x. Use copy.deepcopy(x) to Return a deep copy of x. – Joe Sep 02 '23 at 11:33
35

Doing it manually, creating a new k list and adding entries not found so far:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
new_k = []
for elem in k:
    if elem not in new_k:
        new_k.append(elem)
k = new_k
print k
# prints [[1, 2], [4], [5, 6, 2], [3]]

Simple to comprehend, and you preserve the order of the first occurrence of each element should that be useful, but I guess it's quadratic in complexity as you're searching the whole of new_k for each element.

Paul Stephenson
  • 67,682
  • 9
  • 49
  • 51
  • @paul: very strange - this method is faster than all others – zaharpopov Feb 06 '10 at 17:55
  • 1
    I suspect this method won't be faster for very long lists. It'll depend on your application: if you really just have six-element lists with two duplicates then any solution is likely to be fast enough and you should go with the clearest code. – Paul Stephenson Feb 06 '10 at 18:26
  • 1
    @zaharpopov, It's not quadratic in your benchmark because you duplicate the same list over and over. You are benchmarking with a linear corner case. – Mike Graham Feb 06 '10 at 19:02
  • `k = ([[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [6], [8], [9]] +[[x] for x in range(1000)]) *5` will show up the quadratic behaviour nicely – John La Rooy Feb 06 '10 at 21:27
22
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> k = sorted(k)
>>> k
[[1, 2], [1, 2], [3], [4], [4], [5, 6, 2]]
>>> dedup = [k[i] for i in range(len(k)) if i == 0 or k[i] != k[i-1]]
>>> dedup
[[1, 2], [3], [4], [5, 6, 2]]

I don't know if it's necessarily faster, but you don't have to use to tuples and sets.

danben
  • 80,905
  • 18
  • 123
  • 145
  • Thank you danben. this faster than turn to tuples then 'set' then back to lists? – zaharpopov Feb 06 '10 at 17:24
  • You could easily test that - write both deduping methods, generate some random lists using `random`, and time it with `time`. – danben Feb 06 '10 at 17:26
10

List of tuple and {} can be used to remove duplicates

>>> [list(tupl) for tupl in {tuple(item) for item in k }]
[[1, 2], [5, 6, 2], [3], [4]]
>>> 
SuperNova
  • 25,512
  • 7
  • 93
  • 64
9
a_list = [
          [1,2],
          [1,2],
          [2,3],
          [3,4]
]

print (list(map(list,set(map(tuple,a_list)))))

outputs: [[1, 2], [3, 4], [2, 3]]

Boris Mocialov
  • 3,439
  • 2
  • 28
  • 55
7

Even your "long" list is pretty short. Also, did you choose them to match the actual data? Performance will vary with what these data actually look like. For example, you have a short list repeated over and over to make a longer list. This means that the quadratic solution is linear in your benchmarks, but not in reality.

For actually-large lists, the set code is your best bet—it's linear (although space-hungry). The sort and groupby methods are O(n log n) and the loop in method is obviously quadratic, so you know how these will scale as n gets really big. If this is the real size of the data you are analyzing, then who cares? It's tiny.

Incidentally, I'm seeing a noticeable speedup if I don't form an intermediate list to make the set, that is to say if I replace

kt = [tuple(i) for i in k]
skt = set(kt)

with

skt = set(tuple(i) for i in k)

The real solution may depend on more information: Are you sure that a list of lists is really the representation you need?

Mike Graham
  • 73,987
  • 14
  • 101
  • 130
6

All the set-related solutions to this problem thus far require creating an entire set before iteration.

It is possible to make this lazy, and at the same time preserve order, by iterating the list of lists and adding to a "seen" set. Then only yield a list if it is not found in this tracker set.

This unique_everseen recipe is available in the itertools docs. It's also available in the 3rd party toolz library:

from toolz import unique

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

# lazy iterator
res = map(list, unique(map(tuple, k)))

print(list(res))

[[1, 2], [4], [5, 6, 2], [3]]

Note that tuple conversion is necessary because lists are not hashable.

jpp
  • 159,742
  • 34
  • 281
  • 339
2

Create a dictionary with tuple as the key, and print the keys.

  • create dictionary with tuple as key and index as value
  • print list of keys of dictionary

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

dict_tuple = {tuple(item): index for index, item in enumerate(k)}

print [list(itm) for itm in dict_tuple.keys()]

# prints [[1, 2], [5, 6, 2], [3], [4]]
SuperNova
  • 25,512
  • 7
  • 93
  • 64
  • `list(map(list, dict.fromkeys(map(tuple, k))))` does the same. But you should probably split it out to make it readable :D `dict_of_unique_tuples = dict.fromkeys(map(tuple, k))` and `list(map(list, dict_of_unique_tuples))` – xuiqzy Oct 01 '20 at 13:01
1

Strangely, the answers above removes the 'duplicates' but what if I want to remove the duplicated value also?? The following should be useful and does not create a new object in memory!

def dictRemoveDuplicates(self):
    a=[[1,'somevalue1'],[1,'somevalue2'],[2,'somevalue1'],[3,'somevalue4'],[5,'somevalue5'],[5,'somevalue1'],[5,'somevalue1'],[5,'somevalue8'],[6,'somevalue9'],[6,'somevalue0'],[6,'somevalue1'],[7,'somevalue7']]


print(a)
temp = 0
position = -1
for pageNo, item in a:
    position+=1
    if pageNo != temp:
        temp = pageNo
        continue
    else:
        a[position] = 0
        a[position - 1] = 0
a = [x for x in a if x != 0]         
print(a)

and the o/p is:

[[1, 'somevalue1'], [1, 'somevalue2'], [2, 'somevalue1'], [3, 'somevalue4'], [5, 'somevalue5'], [5, 'somevalue1'], [5, 'somevalue1'], [5, 'somevalue8'], [6, 'somevalue9'], [6, 'somevalue0'], [6, 'somevalue1'], [7, 'somevalue7']]
[[2, 'somevalue1'], [3, 'somevalue4'], [7, 'somevalue7']]
zorze
  • 198
  • 1
  • 6
1

The simplest solution is to convert a list of lists into a list of tuples and then apply dict.fromkeys() method then convert it back to the list.

for example:

you have k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

Convert to list of tuples k = list(map(tuple, k))

This will give you [(1, 2), (4,), (5, 6, 2), (1, 2), (3,), (4,)]

Then do the following: unique = list(dict.fromkeys(k))

You will have [(1, 2), (4,), (5, 6, 2), (3,)]

That's all.

Okroshiashvili
  • 3,677
  • 2
  • 26
  • 40
0

This should work.

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

k_cleaned = []
for ele in k:
    if set(ele) not in [set(x) for x in k_cleaned]:
        k_cleaned.append(ele)
print(k_cleaned)

# output: [[1, 2], [4], [5, 6, 2], [3]]
Zoe L
  • 1,150
  • 14
  • 22
0

A bit of a background, I just started with python and learnt comprehensions.

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
dedup = [elem.split('.') for elem in set(['.'.join(str(int_elem) for int_elem in _list) for _list in k])]
gourab ghosh
  • 159
  • 6
0

If the complaint is not with 'not fast' per se but with 'not concise enough' part of your proposed solution, then in Python 3.5+ with help of unpacking operator and concise tuple notation you can make chained data structure conversions extremely brief (granted, this is still O(n^2), but still unpacking is slightly faster than direct conversion):

Input:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
k = [*map(list, {*map(tuple, k)})]

# If you prefer comprehensions to map()
# k = [[*t] for t in {(*l,) for l in k}]

# Order-preserving alternative:
# k = [*map(list, dict.fromkeys(map(tuple, k)))]

print(k)

Output:

[[1, 2], [4], [5, 6, 2], [3]]
Somebody Out There
  • 347
  • 1
  • 5
  • 15
0

If you want to keep the order of elements intact

You can use dict.fromkeys() from Python 3.7 onwards order will not change:

k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]

[list(x) for x in dict.fromkeys(tuple(x) for x in k)]

#[[1, 2], [4], [5, 6, 2], [3]]

If you don't care about the order of elements then:

[list(x) for x in set(tuple(x) for x in k)]

#[[5, 6, 2], [1, 2], [3], [4]]
Talha Tayyab
  • 8,111
  • 25
  • 27
  • 44
-1

Another probably more generic and simpler solution is to create a dictionary keyed by the string version of the objects and getting the values() at the end:

>>> dict([(unicode(a),a) for a in [["A", "A"], ["A", "A"], ["A", "B"]]]).values()
[['A', 'B'], ['A', 'A']]

The catch is that this only works for objects whose string representation is a good-enough unique key (which is true for most native objects).

jacmkno
  • 1,189
  • 11
  • 25
-1
k=[[1, 2], [4], [5, 6, 2], [1, 2], [3], [5, 2], [3], [8], [9]]
kl=[]
kl.extend(x for x in k if x not in kl)
k=list(kl)
print(k)

which prints,

[[1, 2], [4], [5, 6, 2], [3], [5, 2], [8], [9]]
-1

NumPy provides a unique() function that implements this operation already:

import numpy as np

unique_list = list(np.unique(list_with_duplicates))
joanis
  • 10,635
  • 14
  • 30
  • 40
  • 2
    Welcome to Stack Overflow! While this code may answer the question, providing additional context regarding *why* and/or *how* this code answers the question improves its long-term value. – dan1st Aug 07 '23 at 22:19
  • 1
    Avoid code only answer and provide an explanation. – Itération 122442 Aug 09 '23 at 06:37