Deduping a complex list using a simplified copy of itself

Question

I have two lists of strings that are passed into a function. They are more or less the same, except that one has been run through a regex filter to remove certain boilerplate substrings (e.g. removing 'LLC' from 'Blues Brothers LLC').

This function is meant to internally deduplicate the modified list and remove the associated item in the non-modified list. You can assume that these lists were sorted alphabetically before being run through the regex filter, and remain in the same order (i.e. original[x] and modified[x] refer to the same entity, even if original[x] != modified[x]). Relative order must be maintained between the two lists in the output.

This is what I have so far. It works 99% of the time, except for very rare combinations of inputs and boilerplate strings (1 in 1000s) where some output strings will be mismatched by a single list position. Input lists are 'original' and 'modified'.

# record positions of duplicates so we're not trying to modify the same lists we're iterating
dellist_modified = []
dellist_original = []

# probably not necessary, extra precaution against modifying lists being iterated.
# fwiw the problem still exists if I remove these and change their references in the last two lines directly to the input lists
modified_copy = modified
original_copy = original

for i in range(0, len(modified)-1):
    if modified[i] == modified[i+1]:
dellist_modified.append(modified[i+1])
dellist_original.append(original[i+1])

for j in dellist_modified:
    if j in modified:
        del modified_copy[agg_match.index(j)]
        del original_copy[agg_match.index(j)]

# return modified_copy and original_copy

It's ugly, but it's all I got. My testing indicates the problem is created by the last chunk of code.

Modifications or entirely new approaches would be greatly appreciated. My next step is to try using dictionaries.

When you say ``modified_copy = modified`` all you are doing is making another reference to the existing list, not copying it. To actually copy it, you can do [``modified_copy = list(modified)``](http://henry.precheur.org/python/copy_list), [``modified_copy = modified[:]``](http://stackoverflow.com/questions/323689/python-list-slice-syntax-used-for-no-obvious-reason) or use [the ``copy`` module](http://docs.python.org/library/copy.html). — Gareth Latty, Apr 25 '12 at 20:48
What is the actual test you want to do, I can't quite get this clear - if the item in the modified list is elsewhere in the modified list it should be removed from both lists? — Gareth Latty, Apr 25 '12 at 20:56
I'm trying your suggestion right now, will update when it's finished running. — acpigeon, Apr 25 '12 at 20:59
But you want to retain one of the items in the modified list (`[1, 1, 2]` becomes `[1, 2]` not `[2]`)? — Gareth Latty, Apr 25 '12 at 21:03
Right, with the element of the same address in the unmodified list being removed as well. — acpigeon, Apr 25 '12 at 21:05

Gareth Latty · Accepted Answer · 2012-04-25T21:18:14.927

2

Here is a clean way of doing this:

original = list(range(10))
modified = list(original)
modified[5] = "a"
modified[6] = "a"

def without_repeated(original, modified):
    seen = set()
    for (o, m) in zip(original, modified):
        if m not in seen:
            seen.add(m)
            yield o, m

original, modified = zip(*without_repeated(original, modified))

print(original)
print(modified)

Giving us:

(0, 1, 2, 3, 4, 5, 7, 8, 9)
(0, 1, 2, 3, 4, 'a', 7, 8, 9)

We iterate through both lists at the same time. We keep a set of items we have seen (sets have very fast checks for ownership) and then yields any results that we haven't already seen.

We can then use zip again to give us two lists back.

Note we could actually do this like so:

seen = set()
original, modified = zip(*((o, m) for (o, m) in zip(original, modified) if m not in seen and not seen.add(m)))

This works the same way, except using a single generator expression, with adding the item to the set hacked in using the conditional statement (as add always returns false, we can do this). However, this method is considerably harder to read and so I'd advise against it, just an example for the sake of it.

edited Apr 25 '12 at 21:18

answered Apr 25 '12 at 21:10

Gareth Latty

86,389
17
178
183

Rad, thanks again! Your suggestion in the comments actually solved my problem based on two tests I just ran, so I'm going to go ahead and accept this as the answer. I will also run more tests with the code you just posted here over the next few days and report back. – acpigeon Apr 25 '12 at 21:28
@acpigeon Yeah, it's a nice elegant way of unzipping stuff. I would argue that my solution has some advantages - generally copying lists is something you want to try and avoid. Naturally, however, go with what works best for you. – Gareth Latty Apr 25 '12 at 21:33
Well, it took me a bit to figure out exactly what you were doing ([this really helped me understand the yield statement](http://stackoverflow.com/questions/231767/the-python-yield-keyword-explained)), but I definitely agree that this is the most elegant way to do what I need. Thanks again. – acpigeon Apr 26 '12 at 18:02

score 0 · Answer 2 · answered Apr 25 '12 at 20:53

0

A set in python is a collection of distinct elements. Is the order of these elements critical? Something like this may work:

distinct = list(set(original))

answered Apr 25 '12 at 20:53

g.d.d.c

46,865
9
101
111

Yes, because I need the lists to have the same order relative to each other in the output, I wasn't able to make this work. Thanks though! – acpigeon Apr 25 '12 at 20:54

score 0 · Answer 3 · answered Apr 25 '12 at 21:53

0

Why use parallel lists? Why not a single list of class instances? That keeps things grouped easily, and reduces your list lookups.

answered Apr 25 '12 at 21:53

user1277476

2,871
12
10

Deduping a complex list using a simplified copy of itself

3 Answers3