13

I have a list of strings such as:

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]

I want this outcome (and this is the only acceptable outcome):

myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"]

Note that if an item ("Polypropylene Plastic") happens to contain another item ("Plastic"), I would still like to retain both items. So, the cases can be different, but the item must be a letter-for-letter match, for it to be removed.

The original list order must be retained. All duplicates after the first instance of that item should be removed. The original case of that first instance should be preserved, as well as the original cases of all non-duplicate items.

I've searched and only found questions that address one need or the other, not both.

Crickets
  • 524
  • 1
  • 8
  • 23
  • Do you also need to maintain the cases of the letters in the first item? If not, this boils down to [How do you remove duplicates from a list in whilst preserving order?](https://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-in-whilst-preserving-order) and for Python 3.7 and above you can do `list(dict.fromkeys([item.casefold() for item in myList]))` (as shown in this [answer](https://stackoverflow.com/a/39835527/2285236)) – ayhan Jan 16 '18 at 14:36

6 Answers6

26

It's difficult to code that with a list comprehension (or at the expense of clarity) because of the accumulation/memory effect that you need to filter out duplicates.

It's also not possible to use a set comprehension because it destroys the original order.

Classic way with a loop and an auxiliary set where you store the lowercase version of the strings you're encountering. Store the string in the result list only if the lowercased version isn't in the set

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]
result=[]

marker = set()

for l in myList:
    ll = l.lower()
    if ll not in marker:   # test presence
        marker.add(ll)
        result.append(l)   # preserve order

print(result)

result:

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

using .casefold() instead of .lower() allows to handle subtle "casing" differences in some locales (like the german double "s" in Strasse/Straße).

Edit: it is possible to do that with a list comprehension, but it's really hacky:

marker = set()
result = [not marker.add(x.casefold()) and x for x in myList if x.casefold() not in marker]

It's using and on the None output of set.add to call this function (side effect in a list comprehension, rarely a good thing...), and to return x no matter what. The main disavantages are:

  • readability
  • the fact that casefold() is called twice, once for testing, once for storing in the marker set
Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219
  • great answer as it preserves original case – APorter1031 Jan 16 '18 at 14:24
  • I found it not difficult to do this with a list comprehension. People tend to forget that these can contain function calls with side effects. – Alfe Jan 16 '18 at 14:47
  • 2
    @Alfe People generally avoid those side effects. – ayhan Jan 16 '18 at 14:49
  • @Alfe OP changed input data. My answer is still valid, as it tests exact strings, not substrings. I edited to match the new inputs. See my listcomp attempt. Don't do that at home :) – Jean-François Fabre Jan 16 '18 at 14:56
  • @ayhan You should be *aware* of side effects in general. *Avoiding* them in general deprives you of several advantages they offer. – Alfe Jan 16 '18 at 15:12
  • @Jean-FrançoisFabre I see (about the change in the Q) (and removed my comment about that). Your list comprehension approach really is hacky :) I think mine is okay though. Sure, it uses side effects, but in an acceptable manner. – Alfe Jan 16 '18 at 15:16
3
import pandas as pd
df=pd.DataFrame(myList)
df['lower']=df[0].apply(lambda x: x.lower())
df.groupby('lower',sort=0)[0].first().tolist()

output:

['paper', 'Plastic', 'aluminum', 'tin', 'glass','Polypropylene Plastic']
Binyamin Even
  • 3,318
  • 1
  • 18
  • 45
0

EDIT: Okay, I edited my answer as the question changed in the meantime. Now it checks if the capitalized word is found in the original list and converts it to lowercase when not found.

import string

def custom_filter(my_list):
    seen = set()
    result_list = []
    for i in my_list:
        item = string.capwords(i)
        if item not in my_list:
            item = item.lower()
        if item not in seen:
            result_list.append(item)
            seen.add(item)
    return result_list


print(custom_filter(myList))

Outputs:

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']
Gábor Fekete
  • 1,343
  • 8
  • 16
0
mydict = {}
myList = ["paper", "Plastic", "aluminum", "tin", "glass", "Polypropylene Plastic"]
mynewList = []
for elem in myList:
  if elem.lower() in mydict:
     continue
  else:
     mydict[elem.lower()] = elem.lower()
     mynewList.append(elem)
print(mynewList)

result ['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']

Basically,same as the first answer by @Jean-François Fabre but using dictionary.

Fenil
  • 396
  • 1
  • 5
  • 16
  • What is the benefit of using a `dict` instead of a `set`? Having the key and value be identical seems pointless. – Mark Ransom Feb 13 '22 at 18:07
0

Another way by using collections.defaultdict

from collections import defaultdict

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass", "tin", "PAPER", "Polypropylene Plastic"]
d_dict = defaultdict(list)
for k,v in enumerate(myList):
    d_dict[v.lower()].append(k)

[myList[j] for j in sorted(i[0] for i in d_dict.values())]

Output

['paper', 'Plastic', 'aluminum', 'tin', 'glass', 'Polypropylene Plastic']
Transhuman
  • 3,527
  • 1
  • 9
  • 15
-1

I find the answer of @Gábor Fekete quite good. Here is a continuation of his approach:

myList = ["paper", "Plastic", "aluminum", "PAPer", "tin", "glass",
          "tin", "PAPER", "Polypropylene Plastic"]

def is_already_in(value, used_elements):
  low = value.lower()
  if low in used_elements:
    return True
  used_elements.add(low)
  return False

used_elements = set()
print([ e for e in myList if not is_already_in(e, used_elements) ])
Alfe
  • 56,346
  • 20
  • 107
  • 159