Python: Remove all duplicates from a large list of lists efficiently and elegantly

Question

I have a list of xy coordinates as lists:

print(xy[0:10])

[[104.44464000013596, 21.900339999891116],
 [9.574480000151937, 0.32839999976022227],
 [9.932610000251373, 0.19092000005798582],
 [9.821009999711748, 0.26556000039374794],
 [9.877130000349268, -0.6701499997226392],
 [149.51198999973872, -28.469329999879562],
 [149.35872999988965, -28.684280000021943],
 [9.859010000211413, -0.03293000041912819],
 [9.38918000035676, -0.9979400000309511],
 [77.35380000007001, 32.926530000359264]]

Shown here are the first 10, but I have ~100,000 coordinate pairs in my list.

I would like to remove all duplicate lists from this list, but efficiently. As an easier to comprehend example, I would like to create a function remove_dupes that produces the following result:

a = [[1, 2], [3, 4], [5, 6], [1, 2], [1, 2], [8, 9], [3, 4]]
b = remove_dupes(a)
print(b)
b = [[5, 6], [8 ,9]]

Please note that order is important to preserve.

However, because I have such a large list, I find using the .count() method and iterating through the list to be too time-consuming. I also tried various tricks with set() and numpy's unique function.

Here is the fastest version I could come up with:

xy = [[x1,y1], [x2,y2], ... [xn, yn]]

def remove_dupes(xy):

    xy = [tuple(p) for p in xy] # Tupelize coordinates list for hashing

    p_hash = [hash(tuple(p)) for p in xy] # Hash each coordinate-pair list to a single value

    counts = Counter(p_hash) # Get counts (dictionary) for each unique hash value

    p_remove = [key for (key, value) in counts.items() if value > 1] # Store keys with count > 1

    p_hash = np.array(p_hash) # Cast from list to numpy array 

    remove = np.zeros((len(xy),1), dtype=np.bool) # Initialize storage

    for p in p_remove: # Loop through each non-unique hash and set all the indices where it appears to True // Most time-consuming portion
        remove[np.where(p==p_hash)[0]] = True

    xy = np.array(xy) # Cast back to numpy array for indexing

    xy = xy[remove.flat==False, :]  # Keep only the non-duplicates

    return xy

This takes ~2 seconds for ~100,000 values (and would take longer if there are more duplicate pairs, triples, etc.). What bothers me is that there are functions like numpy.unique() that return counts and indices in fractions of a second, yet I can't figure out how to conform their outputs to solve this problem. I looked through a couple-dozen other Stackexchange posts that were similar, but I found nothing that was both efficient and elegant. Does anyone have any suggestions for a much more elegant way of solving this than I've presented here?

EDIT:

I've received two answers that provide the correct result (and preserve order). RafaelC provided a Pandas option, and DYZ provided a Counter option. I am not that familiar with how to properly time things, but I ran both for 100 iterations (on the same data) with the following results (using time.time())

Pandas: 13.02 sec

Counter: 28.15 sec

I am not sure why the Pandas implementation is faster; one difference is that the Pandas solution returns tuples (which is OK), so I tried the Counter solution without conversion back to lists and it was still 25 seconds.

Does order matter? Are you concerned about float values that are close but not exactly the same? Is numpy possible? — dawg, Sep 21 '18 at 23:56
@dawg Ordering does matter. I'm not sure what you mean by "concerned about float values," but I can say that duplicates will be exact to the highest-precision digit. — Jon, Sep 22 '18 at 00:00
@RafaelC If you could point me to one such duplicate, I'd appreciate it! Like I mentioned, I read through many posts, but I certainly missed something...the example you provided does not apply to my case. I do not want a list of unique lists, I want to remove all duplicate pairs, triples, quadruples, etc. — Jon, Sep 22 '18 at 00:01
@Jon hm I see your point now. So, just to be clear, `[1,2]` is different than `[2,1]` right? — rafaelc, Sep 22 '18 at 00:06
@RafaelC Yes, good point. Preserving ordering of the output is important for the overall list, and ordering of the [xy] lists within the overall list is important to preserve as well. — Jon, Sep 22 '18 at 00:08
You should use module `timeit` or iPython's `%timeit` for timing. All other timing methods are subject to bias and errors. — DYZ, Sep 22 '18 at 00:34
@Jon *I'm not sure what you mean by "concerned about float values,"*. The issue is that at times floats do not have 100% comparability. Are `.10` and `.1000000000000000001` the same number or not? You can get differences from rounding and other artifacts of limited ability to represent floats. It is why you have [numpy.isclose](https://docs.scipy.org/doc/numpy/reference/generated/numpy.isclose.html) — dawg, Sep 22 '18 at 01:32
@dawg Gotcha. That's a good reason not to use the hashing like I was trying! — Jon, Sep 22 '18 at 01:36
*I am not sure why the Pandas implementation is faster* If you are on Python 2, `Counter` was super slow. On Python 3.4 or so plus, `Counter` is super fast. — dawg, Sep 22 '18 at 19:48

score 3 · Accepted Answer · answered Sep 22 '18 at 00:13

3

I'd use pandas

s = pd.Series(list(map(tuple, l)))
s[~s.duplicated(keep=False)].tolist()

Takes

211 ms ± 16.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

for 100000 entries, so a 10x speedup.

answered Sep 22 '18 at 00:13

rafaelc

57,686
15
58
82

The Counter is 32X faster... Pandas are slow beasts. – DYZ Sep 22 '18 at 00:16
Very nice, this is exactly why I asked. Fast, preserves ordering. I don't believe ordering is preserved with the Counter solution provided by @DYZ. (It is not in my tests, while this method certainly does.) I will do some more testing to make sure this is what I need before accepting. – Jon Sep 22 '18 at 00:17
@DYZ Sorry but no necessarily. How did you time this? It depends a lot on how many duplicated you have, length of the list etc. ;) – rafaelc Sep 22 '18 at 00:17
@Jon It's not about what you believe, it's about what it is. A list comprehension always preserves the order. – DYZ Sep 22 '18 at 00:17
@DYZ Ok, but when I run my solution (in the question), I get the same first 10 entries (did not check more) as I do with this answer, but your answer gives me a different first 10 entries in the output. EDIT: I didn't see your edit, now your solution is also great! – Jon Sep 22 '18 at 00:19
I used ipython's %timeit: `%timeit nodups = {k for k,cnt in Counter(map(tuple, l)).items() if cnt == 1};[list(k) for k in map(tuple, l) if k in nodups]` -> 12.5mks. `%timeit s=pd.Series(list(map(tuple, l)));s[~s.duplicated(keep=False)].tolist()` -> 386mks. And yes, pd.Series takes ridiculous amount of time. – DYZ Sep 22 '18 at 00:29
1

@DYZ see my edit for time comparisons...I find pandas to be 2x faster in my tests! – Jon Sep 22 '18 at 00:34

DYZ · Answer 2 · 2018-09-22T00:48:05.017

2

Consider using a Counter:

from collections import Counter

First, convert your lists into tuples because tuples are immutable. Then count the tuples and select only those that happen once. That's a set for non-duplicates:

nodups = {k for k,cnt in Counter(map(tuple, a)).items() if cnt == 1}

Now, since the order is important, filter the original list against non-dups:

[list(k) for k in map(tuple, a) if k in nodups]
#[[5, 6], [8, 9]]

edited Sep 22 '18 at 00:48

answered Sep 22 '18 at 00:06

DYZ

55,249
10
64
93

bravo how did I miss map here – vash_the_stampede Sep 22 '18 at 00:08
Does this preserve ordering? Testing against my method suggests that it does not, but I may have made a mistake. It does result in the same length of output, so that's good... – Jon Sep 22 '18 at 00:11
I tested against your data, and it does preserve the order (perhaps you saw an earlier version of the answer). – DYZ Sep 22 '18 at 00:13
I can confirm seeing the correct output for the data set you provided in the question. – dmulter Sep 22 '18 at 00:25
@DYZ I did see an earlier version of your post, and your edit works flawlessly. – Jon Sep 22 '18 at 00:26
`pandas` works better for big data, not for a list with 10 entries ;) – rafaelc Sep 22 '18 at 00:38
2

@RafaelC Ok, for an array of 10,000 entries, Pandas is 20% faster. You win. – DYZ Sep 22 '18 at 00:40
@DYZ 100,000 entries, actually :P Still, I learned from your answer (although I'm still having trouble digesting it), so thanks! – Jon Sep 22 '18 at 00:41

score 1 · Answer 3 · answered Sep 24 '18 at 19:27

In Python 3.6+ dictionaries keep their insertion order, so DYZ's Counter solution can be much improved by relying on that:

[list(k) for k, c in Counter(map(tuple, a)).items() if c == 1]

On my computer it is faster than the pandas solution.

RafaelC's pandas solution can also be sped up a great deal. The key is to switch from Series to DataFrame:

s = pd.DataFrame(a)
return s[~s.duplicated(keep=False)].values.tolist()

On my computer it is nearly twice as fast as original pandas solution. The key to the speedup is that it avoids doing prep work outside of pandas (list(map(tuple, l))).

score 0 · Answer 4 · answered Sep 22 '18 at 00:24

0

I have a solution that is efficient and is also built in

import itertools
xy = [[104.44464000013596, 21.900339999891116],
 [9.574480000151937, 0.32839999976022227],
 [9.932610000251373, 0.19092000005798582],
 [9.821009999711748, 0.26556000039374794],
 [9.877130000349268, -0.6701499997226392],
 [149.51198999973872, -28.469329999879562],
 [149.35872999988965, -28.684280000021943],
 [9.859010000211413, -0.03293000041912819],
 [9.38918000035676, -0.9979400000309511],
 [77.35380000007001, 32.926530000359264]]

xy.sort() # sorting the data
sorted_data = list(xy for xy,_ in itertools.groupby(xy)) # grouping

Note: I have tested two methods, using numpy and itertools. Numpy took 13 secs in data of length 10000000 and intertools took 1 secs in data of length 10000000

answered Sep 22 '18 at 00:24

tbhaxor

1,659
2
13
43

I am assuming that the initial sort does not preserve order, which is important... – Jon Sep 22 '18 at 00:27
This does not produce the correct output for the test data. – dmulter Sep 22 '18 at 00:27
The limitation of my approach is you will the order of data. So i will ask you to use it carefully. ps: my mistake, i forgot to mention that – tbhaxor Sep 22 '18 at 14:17

Python: Remove all duplicates from a large list of lists efficiently and elegantly

4 Answers4