How to remove case-insensitive duplicates from a list, while maintaining the original list order?

Question

I have a list of strings such as:

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]

I want this outcome (and this is the only acceptable outcome):

myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"]

Note that if an item ("Polypropylene Plastic") happens to contain another item ("Plastic"), I would still like to retain both items. So, the cases can be different, but the item must be a letter-for-letter match, for it to be removed.

The original list order must be retained. All duplicates after the first instance of that item should be removed. The original case of that first instance should be preserved, as well as the original cases of all non-duplicate items.

I've searched and only found questions that address one need or the other, not both.

Do you also need to maintain the cases of the letters in the first item? If not, this boils down to [How do you remove duplicates from a list in whilst preserving order?](https://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order) and for Python 3.7 and above you can do `list(dict.fromkeys([item.casefold() for item in myList]))` (as shown in this [answer](https://stackoverflow.com/a/39835527/2285236)) — ayhan, Jan 16 '18 at 14:36

Jean-François Fabre · Accepted Answer · 2018-01-16T14:54:46.717

26

It's difficult to code that with a list comprehension (or at the expense of clarity) because of the accumulation/memory effect that you need to filter out duplicates.

It's also not possible to use a set comprehension because it destroys the original order.

Classic way with a loop and an auxiliary set where you store the lowercase version of the strings you're encountering. Store the string in the result list only if the lowercased version isn't in the set

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]
result=[]

marker = set()

for l in myList:
    ll = l.lower()
    if ll not in marker:   # test presence
        marker.add(ll)
        result.append(l)   # preserve order

print(result)

result:

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

using .casefold() instead of .lower() allows to handle subtle "casing" differences in some locales (like the german double "s" in Strasse/Straße).

Edit: it is possible to do that with a list comprehension, but it's really hacky:

marker = set()
result = [not marker.add(x.casefold()) and x for x in myList if x.casefold() not in marker]

It's using and on the None output of set.add to call this function (side effect in a list comprehension, rarely a good thing...), and to return x no matter what. The main disavantages are:

readability
the fact that casefold() is called twice, once for testing, once for storing in the marker set

edited Jan 16 '18 at 14:54

answered Jan 16 '18 at 14:22

Jean-François Fabre

137,073
23
153
219

great answer as it preserves original case – APorter1031 Jan 16 '18 at 14:24
I found it not difficult to do this with a list comprehension. People tend to forget that these can contain function calls with side effects. – Alfe Jan 16 '18 at 14:47
2

@Alfe People generally avoid those side effects. – ayhan Jan 16 '18 at 14:49
@Alfe OP changed input data. My answer is still valid, as it tests exact strings, not substrings. I edited to match the new inputs. See my listcomp attempt. Don't do that at home :) – Jean-François Fabre Jan 16 '18 at 14:56
@ayhan You should be *aware* of side effects in general. *Avoiding* them in general deprives you of several advantages they offer. – Alfe Jan 16 '18 at 15:12
@Jean-FrançoisFabre I see (about the change in the Q) (and removed my comment about that). Your list comprehension approach really is hacky :) I think mine is okay though. Sure, it uses side effects, but in an acceptable manner. – Alfe Jan 16 '18 at 15:16

Binyamin Even · Answer 2 · 2018-01-16T14:49:56.923

3

import pandas as pd
df=pd.DataFrame(myList)
df['lower']=df[0].apply(lambda x: x.lower())
df.groupby('lower',sort=0)[0].first().tolist()

output:

['paper', 'Plastic', 'aluminum', 'tin', 'glass','Polypropylene Plastic']

edited Jan 16 '18 at 14:49

answered Jan 16 '18 at 14:42

Binyamin Even

3,318
1
18
45

Gábor Fekete · Answer 3 · 2018-01-16T14:39:59.893

EDIT: Okay, I edited my answer as the question changed in the meantime. Now it checks if the capitalized word is found in the original list and converts it to lowercase when not found.

import string

def custom_filter(my_list):
    seen = set()
    result_list = []
    for i in my_list:
        item = string.capwords(i)
        if item not in my_list:
            item = item.lower()
        if item not in seen:
            result_list.append(item)
            seen.add(item)
    return result_list


print(custom_filter(myList))

Outputs:

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

Fenil · Answer 4 · 2018-01-16T14:54:09.387

0

mydict = {}
myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"]
mynewList = []
for elem in myList:
  if elem.lower() in mydict:
     continue
  else:
     mydict[elem.lower()] = elem.lower()
     mynewList.append(elem)
print(mynewList)

result ['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

Basically,same as the first answer by @Jean-François Fabre but using dictionary.

edited Jan 16 '18 at 14:54

answered Jan 16 '18 at 14:38

Fenil

396
1
5
16

What is the benefit of using a `dict` instead of a `set`? Having the key and value be identical seems pointless. – Mark Ransom Feb 13 '22 at 18:07

Transhuman · Answer 5 · 2018-01-17T06:35:16.703

0

Another way by using collections.defaultdict

from collections import defaultdict

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]
d_dict = defaultdict(list)
for k,v in enumerate(myList):
    d_dict[v.lower()].append(k)

[myList[j] for j in sorted(i[0] for i in d_dict.values())]

Output

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

edited Jan 17 '18 at 06:35

answered Jan 16 '18 at 14:48

Transhuman

3,527
1
9
15

@Crickets - updated the list comprehension to preserve the list order – Transhuman Jan 17 '18 at 06:36

score -1 · Answer 6 · answered Jan 16 '18 at 14:41

-1

I find the answer of @Gábor Fekete quite good. Here is a continuation of his approach:

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass",
          "tin", "PAPER", "Polypropylene Plastic"]

def is_already_in(value, used_elements):
  low = value.lower()
  if low in used_elements:
    return True
  used_elements.add(low)
  return False

used_elements = set()
print([ e for e in myList if not is_already_in(e, used_elements) ])

answered Jan 16 '18 at 14:41

Alfe

56,346
20
107
159

You could also add the capitalization of the words as I did in my updated answer. – Gábor Fekete Jan 16 '18 at 14:55
Why? I'm implementing case insensitivity by using `lower()`. – Alfe Jan 16 '18 at 15:09

How to remove case-insensitive duplicates from a list, while maintaining the original list order?

6 Answers6

Linked

Related