Python remove partial duplicates from a list

Question

I have a list of items that was improperly created. Instead of copying the whole item once, it made multiple partial copies of the same item. The partial duplicates are mixed with other duplicates and some unique items. For example list a:

a = ['one two','one two three four','one two three','five six','five six seven','eight nine']

I want to remove the partial duplicates and keep the longest expression of the item. For example I would like to produce list b:

b = ['one two three four', 'five six seven','eight nine']

The integrity of the item must remain intact, cannot become:

c = '[two one three four', 'vife six seven', 'eight nine']

What's the expected behavior on `['12', '13', '1']`? What about `['12', '2']`? What about `['123', '2']`? — Brian McCutchon, Nov 11 '20 at 02:28
This problem is poorly constrained. Are the items to be discarded always located directly before? — wim, Nov 11 '20 at 02:31
Are the partial duplicates always going to appear before (or at least, adjacent) to the things they duplicate? Or could `'one two'` appear at the very end of your input and it should still be eliminated? — ShadowRanger, Nov 11 '20 at 02:32
I just updated the list. Longest item is not always after the shortest. the order is random. — Mario Tomas, Nov 11 '20 at 02:55
If that's the case, just sort them, and then my answer should work. — Dennis, Nov 11 '20 at 03:03

Dennis · Answer 1 · 2020-11-11T03:13:50.687

3

Try this:

def group_partials(strings):
    it = iter(sorted(strings))
    prev = next(it)
    for s in it:
        if not s.startswith(prev):
            yield prev
        prev = s
    yield s

a = ['one two','one two three', 'one two three four', 'five six', 'five six seven', 'eight nine']
b = list(group_partials(a))

edited Nov 11 '20 at 03:13

answered Nov 11 '20 at 02:28

Dennis

2,249
6
12

Mike67 · Accepted Answer · 2020-11-11T04:50:23.277

3

You can use sets for this.

Try this code

a = ['one two','one two three', 'one two three four', 'five six', 'five six seven','eight nine']

# check for subsets
for i in range(len(a)):
   for j in range(len(a)):
      if i==j: continue # same index
      if (set(a[i].split()) & set(a[j].split())) == set(a[i].split()): # if subset
         a[i]="" # clear string

# a = [x for x in a if len(x)]  # remove empty strings

b = []
for x in a:  # each string in a
   if len(x) > 0: # if not empty
      b.append(x)  # add to final list  

a = b

print(a)

Output

['one two three four', 'five six seven', 'eight nine']

edited Nov 11 '20 at 04:50

answered Nov 11 '20 at 02:32

Mike67

11,175
2
7
15

When updating a list while iterating it, use the index so the memory address stays the same. If I use `for i in a`, then setting `i='abc'` won't update the list, it will just a create a local variable. – Mike67 Nov 11 '20 at 03:42
wouldnt always set(a[i].split()) == set(a[i].split()) . What does the & do, join them? – Mario Tomas Nov 11 '20 at 04:09
I added parenthesis in the answer to make it clearer. Check out python set operations: https://www.geeksforgeeks.org/python-set-operations-union-intersection-difference-symmetric-difference/ – Mike67 Nov 11 '20 at 04:13
"x for x in a if len(x)" Can you simplify this logic for me to understand, or link me something I can read. many thanks – Mario Tomas Nov 11 '20 at 04:44
Answer updated with `for` loop. The previous code used list comprehension: https://jakevdp.github.io/WhirlwindTourOfPython/11-list-comprehensions.html – Mike67 Nov 11 '20 at 04:56
I think you can do `set(a[i].split()).issubset(set(a[j].split()))`. – Brian McCutchon Nov 11 '20 at 06:05
It's also worth noting that this is rather slow (quadratic time). Dennis' answer (adding sorting) would be much faster on large lists (O(n log n)). – Brian McCutchon Nov 11 '20 at 06:08

Python remove partial duplicates from a list

2 Answers2

Linked