delete duplicates from list of tuples

Question

I have a list of tuples which unfortunately contain duplicates, like so:

[(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c'), (64, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (61, u'top-coldestcitiesinamerica'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion'), (43, u'x_file')]

The problem is that the first element (0 based ordering) of the tuple is the entry I want to check for duplicates. So, I can see:

(67, u'top-coldestcitiesinamerica')
(61, u'top-coldestcitiesinamerica')

..are duplicates and I would like to delete one of them (similar to a set). So, at the end, I'd like to have a clean list of tuples with no duplicates like so (i.e no duplicates on the first element of the tuple):

[(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c') (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion')]

How can I achieve this in a pythonic way? Thanks!

score 5 · Accepted Answer · edited May 23 '17 at 12:22

You could use the set approach from How do you remove duplicates from a list in whilst preserving order?, using x[1] as the unique identifier:

def unique_second_element(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x[1] in seen or seen_add(x[1]))]

Note that the OrderedDict approach also shown would also work if you wanted to preserve the last occurrence; for a first occurrence you'd have to reverse the input then reverse again for the output.

You could make this even more generic by supporting a key function:

def unique_preserve_order(seq, key=None):
    if key is None:
        key = lambda elem: elem
    seen = set()
    seen_add = seen.add
    augmented = ((key(x), x) for x in seq)
    return [x for k, x in augmented if not (k in seen or seen_add(k))]

then use

import operator

unique_preserve_order(yourlist, key=operator.itemgetter(1))

Demo:

>>> def unique_preserve_order(seq, key=None):
...     if key is None:
...         key = lambda elem: elem
...     seen = set()
...     seen_add = seen.add
...     augmented = ((key(x), x) for x in seq)
...     return [x for k, x in augmented if not (k in seen or seen_add(k))]
... 
>>> from pprint import pprint
>>> import operator
>>> yourlist = [(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c'), (64, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (61, u'top-coldestcitiesinamerica'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion'), (43, u'x_file')]
>>> pprint(unique_preserve_order(yourlist, operator.itemgetter(1)))
[(67, u'top-coldestcitiesinamerica'),
 (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'),
 (65, u'a-b-c-ca-d-ab-ea-d-c-c'),
 (63,
  u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'),
 (62, u'ghgemissions'),
 (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'),
 (57, u'culture'),
 (55, u'cas-k-ihaveanidea'),
 (54, u'trendsfor'),
 (53, u'batteryimpedance'),
 (52, u'evs-howey-full'),
 (51, u'bericht'),
 (49, u'classiccarinsurance'),
 (47, u'uploaded_file'),
 (46, u'x_file'),
 (45, u's-s-main'),
 (44, u'vehicle-propulsion')]

Sorry about the delay in responding - I ended up using your `unique_second_element` method - works like a charm. Thank you very much! — AJW, Mar 16 '15 at 09:36

score 1 · Answer 2 · answered Mar 03 '15 at 14:19

As an alternative answer you can use itertools.groupby(),this could be helpful if you have a huge list,but is not as good as set :

>>> from itertools import groupby
>>> from operator import itemgetter
>>> [next(g) for _,g in groupby(sorted(l,key=itemgetter(1)),itemgetter(1))]
[(65, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (53, u'batteryimpedance'), (51, u'bericht'), (55, u'cas-k-ihaveanidea'), (49, u'classiccarinsurance'), (57, u'culture'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (52, u'evs-howey-full'), (62, u'ghgemissions'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (45, u's-s-main'), (67, u'top-coldestcitiesinamerica'), (54, u'trendsfor'), (47, u'uploaded_file'), (44, u'vehicle-propulsion'), (46, u'x_file')]

This kills the ordering and the sort makes this a O(NlogN) solution, vs. my O(N) approach. — Martijn Pieters, Mar 03 '15 at 14:23
@MartijnPieters Unfortunately yes! but maybe its not matter for OP!and i have mentioned that `set` is a better recipe! — Mazdak, Mar 03 '15 at 14:24

Vivek Sable · Answer 3 · 2015-03-03T14:35:12.313

Define Check list variable to add key.
Iterate every item from the input list.
Check key is present or not in the check list.
If not present then add item to result list and update check list.
Print result.

code:

input_list = [(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c'), (64, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (61, u'top-coldestcitiesinamerica'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion'), (43, u'x_file')]

check_list = set()
result = []
for i in input_list:
    if not i[1] in check_list:
        result.append(i)
        check_list.add(i[1])

import pprint
pprint.pprint(result)

Output:

$ python task4.py 
[(67, u'top-coldestcitiesinamerica'),
 (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'),
 (65, u'a-b-c-ca-d-ab-ea-d-c-c'),
 (63,
  u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'),
 (62, u'ghgemissions'),
 (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'),
 (57, u'culture'),
 (55, u'cas-k-ihaveanidea'),
 (54, u'trendsfor'),
 (53, u'batteryimpedance'),
 (52, u'evs-howey-full'),
 (51, u'bericht'),
 (49, u'classiccarinsurance'),
 (47, u'uploaded_file'),
 (46, u'x_file'),
 (45, u's-s-main'),
 (44, u'vehicle-propulsion')]

@MartijnPieters: Apology. used set. – Vivek Sable Mar 03 '15 at 14:53 — Vivek Sable, Mar 03 '15 at 14:53

score 0 · Answer 4 · answered Mar 03 '15 at 14:27

I did it a very plain and easy way.

lst=[(67, u'top-coldestcitiesinamerica'), (66, u'ecofriendlyideastocelebrateindependenceday-phpapp'), (65, u'a-b-c-ca-d-ab-ea-d-c-c'), (64, u'a-b-c-ca-d-ab-ea-d-c-c'), (63, u'alexandre-meybeck-faowhatisclimate-smartagriculture-backgroundopportunitiesandchallenges'), (62, u'ghgemissions'), (61, u'top-coldestcitiesinamerica'), (58, u'infographicthe-stateofdigitaltransformationaltimetergroup'), (57, u'culture'), (55, u'cas-k-ihaveanidea'), (54, u'trendsfor'), (53, u'batteryimpedance'), (52, u'evs-howey-full'), (51, u'bericht'), (49, u'classiccarinsurance'), (47, u'uploaded_file'), (46, u'x_file'), (45, u's-s-main'), (44, u'vehicle-propulsion'), (43, u'x_file')]

lst2 = [] #empty list to fill with unique tuples
lst_banned = [] #empty list to fill with banned elements

for tup in lst:
    if tup[-1] not in lst_banned:
        lst_banned.append(tup[-1])
        lst2.append(tup)

lst=lst2
del lst2
del lst_banned

I just see that there was a similar answer posted while I wrote this. Sorry! :) — Robin Kastner, Mar 03 '15 at 14:29
Same comment for you: using a list to track unique elements is **slow** as each test takes up to `len(lst_banned)` steps. A set lets you test for membership in *constant time*. — Martijn Pieters, Mar 03 '15 at 14:29
Good point! 'set' is more pythonic... I think, that was the point of the question, too! — Robin Kastner, Mar 03 '15 at 14:33

delete duplicates from list of tuples

4 Answers4