How to remove duplicates from list, whilst also adding duplicate-count to respective list items?

Question

My question is the same as

How to remove case-insensitive duplicates from a list, while maintaining the original list order?

except that I would also like the duplicate items to reflect the number of duplicates in the item string itself (in parentheses).

Example input:

myList = ["paper", "Plastic", "aluminum", "PAPer", "TIN", " paper", "glass", "tin", "PAPER", "Polypropylene Plastic"]

The only acceptable output:

myList = ["paper (3)", "Plastic", "aluminum", "TIN (2)", " paper", "glass", "Polypropylene Plastic"]

Notes:

Note that if an item ("Polypropylene Plastic") happens to contain another item ("Plastic"), I would still like to retain both items.
So, the cases can differ, but the item must be a character-for-character match, for it to be removed.
The original list order must be retained.
All duplicates after the first instance of that item should be removed. The original case of that first instance should be preserved, as well as the original cases of all non-duplicate items.

I am looking for the fastest method to accomplish this in Python 2.7.

@ReblochonMasque There's no real problem with my code. It's just slow and unsophisticated. All it does is 1) get tallies of duplicate-items, 2) removes duplicates from the list, and 3) modifies each duplicated-item string with the previously-gotten duplicate-count in 3 separate `for` loops. I didn't post my clunky code because I can't imagine that a SO member would want to use my code as any sort of starting point. — Crickets, Jul 03 '18 at 05:38
@Crickets Posting your code does much more than give us a "starting point". It also shows that you have devoted some effort to the problem as well as giving us a point of comparison for the expected behaviour in an executable form which works much better than a description of what you are trying to achieve. — chthonicdaemon, Jul 03 '18 at 05:45

abarnert · Answer 1 · 2018-07-03T05:20:39.390

In the original question, you presumably (I just glanced at it) used a set of the casefolded strings to see if you had a new one or a repeat, building a list of new ones as you go along.

You can replace this with a Counter instead of a set. But then you need to build the list and then go back and edit it with the counts.

So instead, replace both the set/Counter and the output list with an OrderedDict that stores item-count pairs for each case-folded item:

d = collections.OrderedDict()
for item in myList:
    caseless = item.lower()
    try:
        d[caseless][1] += 1
    except KeyError:
        d[caseless] = [item, 1]

… and then do a pass over that dict to generate the output list:

myList = []
for item, count in d.values():
    if count > 1:
        item = '{} ({})'.format(item, count)
    myList.append(item)

You can make this more concise (e.g., myList = ['{} ({})'.format(item, count) if count > 1 else item for item, count in d.values()), and that will also make it a bit faster by a small constant factor.

You can probably shave off a few nanoseconds by using % instead of format too, and possibly even more with %d instead of %s (although I think that last part no longer true even by 2.7).

Depending on your platform, a[0] += 1 may be faster or slower than a[1] += 1. So try it both ways, and if a[0] is faster, use [count, item] pairs instead of [item, count]. If you have a ton of dups, you may want to consider a class with __slots__, which can actually be slightly faster to update, but significantly slower to create, than a list.

Also, using an in test, or maybe storing d.__contains__ as a local, may be faster than try—or it may be slower, depending on how many repeats you expect to have, so try it all three ways on your actual data rather than a toy dataset.

RoadRunner · Answer 2 · 2018-07-03T06:09:14.327

1

You could also try using a collections.Counter() object to keep track of the counts, and use it to keep track of what words have been seen, using caseless words as reference. Then when you are finished iterating over the input list, update the result list to have the word counts in the form %s (%d), if the count is greater than 1.

Code:

from collections import Counter

words = ["paper", "Plastic", "aluminum", "PAPer", "TIN", " paper", "glass", "tin", "PAPER", "Polypropylene Plastic"]

counts = Counter()
result = []

for word in words:
    caseless = word.casefold()

    if caseless not in counts:
        result.append(word)

    counts[caseless] += 1

result = ['%s (%d)' % (w, counts[w.casefold()]) if counts[w.casefold()] > 1 
                                                else w for w in result]

print(result)

Output:

['paper (3)', 'Plastic', 'aluminum', 'TIN (2)', ' paper', 'glass', 'Polypropylene Plastic']

edited Jul 03 '18 at 06:09

answered Jul 03 '18 at 05:26

RoadRunner

25,803
6
42
75

You can get rid of the set by using the Counter to keep track of what you've seen. You also traverse the word list three times in this solution, while you only need two. – chthonicdaemon Jul 03 '18 at 05:58
@chthonicdaemon Updated. – RoadRunner Jul 03 '18 at 06:09
It seems that Python 2.7 does not support `casefold`. – Crickets Jul 03 '18 at 18:38
@Crickets use `str.lower()` instead. – RoadRunner Jul 04 '18 at 00:15

chthonicdaemon · Accepted Answer · 2018-07-04T05:25:10.627

Here is a version using a single Counter, avoiding the use of another set as in @RoadRunner's solution by popping keys from the Counter as we pass them. This may be slightly slower than the OrderedDict solution if there are many duplicates, but will use less memory:

from collections import Counter

words = ["paper", "Plastic", "aluminum", "PAPer", "TIN", " paper", "glass", "tin", "PAPER", "Polypropylene Plastic"]

counter = Counter(w.lower() for w in words)

result = []
for word in words:
    key = word.lower()
    if key in counter:
        count = counter[key]
        if count == 1:
            result.append(word)
        else:
            result.append('{} ({})'.format(word, count))
        counter.pop(key)

Note You should use casefold instead of lower for Python >= 3.3

It seems that Python 2.7 does not support `casefold`. – Crickets Jul 03 '18 at 18:38 — Crickets, Jul 03 '18 at 18:38
@Crickets use str.lower () instead – RoadRunner Jul 03 '18 at 23:52 — RoadRunner, Jul 03 '18 at 23:52

How to remove duplicates from list, whilst also adding duplicate-count to respective list items?

3 Answers3