I have a list of xy coordinates as lists:
print(xy[0:10])
[[104.44464000013596, 21.900339999891116],
[9.574480000151937, 0.32839999976022227],
[9.932610000251373, 0.19092000005798582],
[9.821009999711748, 0.26556000039374794],
[9.877130000349268, -0.6701499997226392],
[149.51198999973872, -28.469329999879562],
[149.35872999988965, -28.684280000021943],
[9.859010000211413, -0.03293000041912819],
[9.38918000035676, -0.9979400000309511],
[77.35380000007001, 32.926530000359264]]
Shown here are the first 10, but I have ~100,000 coordinate pairs in my list.
I would like to remove all duplicate lists from this list, but efficiently. As an easier to comprehend example, I would like to create a function remove_dupes
that produces the following result:
a = [[1, 2], [3, 4], [5, 6], [1, 2], [1, 2], [8, 9], [3, 4]]
b = remove_dupes(a)
print(b)
b = [[5, 6], [8 ,9]]
Please note that order is important to preserve.
However, because I have such a large list, I find using the .count() method and iterating through the list to be too time-consuming. I also tried various tricks with set() and numpy's unique function.
Here is the fastest version I could come up with:
xy = [[x1,y1], [x2,y2], ... [xn, yn]]
def remove_dupes(xy):
xy = [tuple(p) for p in xy] # Tupelize coordinates list for hashing
p_hash = [hash(tuple(p)) for p in xy] # Hash each coordinate-pair list to a single value
counts = Counter(p_hash) # Get counts (dictionary) for each unique hash value
p_remove = [key for (key, value) in counts.items() if value > 1] # Store keys with count > 1
p_hash = np.array(p_hash) # Cast from list to numpy array
remove = np.zeros((len(xy),1), dtype=np.bool) # Initialize storage
for p in p_remove: # Loop through each non-unique hash and set all the indices where it appears to True // Most time-consuming portion
remove[np.where(p==p_hash)[0]] = True
xy = np.array(xy) # Cast back to numpy array for indexing
xy = xy[remove.flat==False, :] # Keep only the non-duplicates
return xy
This takes ~2 seconds for ~100,000 values (and would take longer if there are more duplicate pairs, triples, etc.). What bothers me is that there are functions like numpy.unique() that return counts and indices in fractions of a second, yet I can't figure out how to conform their outputs to solve this problem. I looked through a couple-dozen other Stackexchange posts that were similar, but I found nothing that was both efficient and elegant. Does anyone have any suggestions for a much more elegant way of solving this than I've presented here?
EDIT:
I've received two answers that provide the correct result (and preserve order). RafaelC provided a Pandas option, and DYZ provided a Counter option. I am not that familiar with how to properly time things, but I ran both for 100 iterations (on the same data) with the following results (using time.time())
Pandas: 13.02 sec
Counter: 28.15 sec
I am not sure why the Pandas implementation is faster; one difference is that the Pandas solution returns tuples (which is OK), so I tried the Counter solution without conversion back to lists and it was still 25 seconds.