How to remove duplicates from a list of tuples but keeping the original order

Question

I want to remove redundant tuples but preserve the order of appearance. I looked at similar questions. This question Find unique rows in numpy.array looked very promising but somehow it did not work for me.

I could use pandas as in this answer (https://stackoverflow.com/a/14089586/566035) but I prefer not using pandas so that the py2exe generated executable will be small.

import numpy as np

data = [('a','z'), ('a','z'), ('a','z'), ('1','z'), ('e','z'), ('c','z')]

#What I want is:
    array([['a', 'z'],
           ['1', 'z'],
           ['e', 'z'],
           ['c', 'z']], 
          dtype='|S1')

#What I have tried:
# (1) numpy.unique, order not preserved
np.unique(data)

    array([['a', 'z'],
           ['c', 'z'],
           ['1', 'z'],
           ['e', 'z']], 
          dtype='|S1')

# (2) python set, order not preserved
set(data)

    set([('1', 'z'), ('a', 'z'), ('c', 'z'), ('e', 'z')])

# (3) answer here : https://stackoverflow.com/a/16973510/566035, order not preserved
a = np.array(data)
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)

a[idx]

    array([['1', 'z'],
           ['a', 'z'],
           ['c', 'z'],
           ['e', 'z']], 
          dtype='|S1')

Saish · Accepted Answer · 2014-09-03T22:54:09.750

2

This isn't great in terms of efficiency, but is very simple readable code and can work for smaller lists:

sorted(set(data), key=data.index)

edited Sep 03 '14 at 22:54

answered Sep 03 '14 at 18:03

Saish

511
3
11

Wow. This is nice, too. Thanks! – otterb Sep 03 '14 at 18:21
It was hard to pick the answer but I think I like this better. Thanks everyone! – otterb Sep 03 '14 at 18:57
Wow - the overhead of a sort and index operation on this is phenomenal... I wouldn't say it "isn't great in terms of efficiency" - I'd say it's really, really poor :( – Jon Clements Sep 03 '14 at 23:05

score 1 · Answer 2 · answered Sep 03 '14 at 17:40

1

Oops! I found an answer myself...

seen = set()
np.array([x for x in data if x not in seen and not seen.add(x)])

# output
array([['a', 'z'],
       ['1', 'z'],
       ['e', 'z'],
       ['c', 'z']], 
      dtype='|S1')

answered Sep 03 '14 at 17:40

otterb

2,660
2
29
48

I never would have thought to use `and not` to force a call that returns `None` like that. Sneaky! – Mark Ransom Sep 03 '14 at 18:31
Yes, tricky. This is something I found somewhere in stackoverflow a long ago. But i cannot recall where. – otterb Sep 03 '14 at 18:51

score 0 · Answer 3 · answered Apr 13 '22 at 13:37

From my testing a list comprehension with a set to check uniqueness would improve the runtime by 4x (or from O(n^2) to O(n) complexity)

import functools
import time
import string
import random
# initializing list
data = [
    (random.choice(string.ascii_lowercase), 
    random.choice(string.ascii_lowercase)) 
    for _ in range(10_000)
]

# using reduce to isolate uniques while keeping order
start_time = time.time()
control_set = set()
result1 = functools.reduce(lambda a, b: control_set.add(b) or a+[b] if b not in control_set else a, data, [])
print("--- %s seconds ---" % (time.time() - start_time))
"--- 0.002000570297241211 seconds ---"

# creating a set and ordering them by original index
start_time = time.time()
control_set = set()
result2 = sorted(set(data), key=data.index)
print("--- %s seconds ---" % (time.time() - start_time))
"--- 0.00800013542175293 seconds ---"

# list comprehension with set-membership
start_time = time.time()
control_set = set()
result3 = [
    data_element 
    for data_element in data 
    if data_element not in control_set
    and (control_set.add(data_element) or True)
]
print("--- %s seconds ---" % (time.time() - start_time))
"--- 0.0010018348693847656 seconds ---"

def get_unique(rec_list):
    if len(rec_list) == 1:
        return rec_list
    l = get_unique(rec_list[:len(rec_list)//2])
    r = get_unique(rec_list[len(rec_list)//2:])

    set_l = set(l)
    set_r = set(r)
    set_r -= (set_l.intersection(set_r))
    return sorted(set_l, key=l.index) + sorted(set_r, key=r.index)

start_time = time.time()
result4 = get_unique(data)
print("--- %s seconds ---" % (time.time() - start_time))
"--- 0.11902070045471191 seconds ---"

assert result1 == result2 == result3 == result4, "one of them failed"

How to remove duplicates from a list of tuples but keeping the original order

3 Answers3