Removing duplicates from a list of numPy arrays

Question

I have an ordinary Python list that contains (multidimensional) numPy arrays, all of the same shape and with the same number of values. Some of the arrays in the list are duplicates of earlier ones.

I have the problem that I want to remove all the duplicates, but the fact that the data type is numPy arrays complicates this a bit...

• I can't use set() as numPy arrays are not hashable.
• I can't check for duplicates during insertion, as the arrays are generated in batches by a function and added to the list with .extend().
• numPy arrays aren't directly comparable without resorting to one of numPy's own functions, so I can't just go something that uses "if x in list"...
• The contents of the list need to remain numPy arrays at the end of the process; I could compare copies of the arrays converted to nested lists, but I can't convert the arrays to straight python lists permanently.

Any suggestions on how I can remove duplicates efficiently here?

score 7 · Accepted Answer · edited May 23 '17 at 11:58

7

Using the solutions here: Most efficient property to hash for numpy array we see that hashing works best with a.tostring() if a is an numpy array. So:

import numpy as np
arraylist = [np.array([1,2,3,4]), np.array([1,2,3,4]), np.array([1,3,2,4])]
L = {array.tostring(): array for array in arraylist}
L.values() # [array([1, 3, 2, 4]), array([1, 2, 3, 4])]

edited May 23 '17 at 11:58

Community

1
1

answered Jan 03 '15 at 02:18

Joel

22,598
6
69
93

Ooh, that looks good. I can have the key just be the array.tolist(). Thank you! – SoItBegins Jan 03 '15 at 02:28
So lists aren't hashable either but I think I can find something. – SoItBegins Jan 03 '15 at 02:41
@SoItBegins `list` and `ndarray` are *mutable* -- if you tried to define a hash value of a `list` it would need to work based on the hash values of the contents of that `list` -- but if the contents can change because it is mutable, then the hash value could suddenly be wrong for the contents of the array. – ely Jan 03 '15 at 02:41
You're leaving out the detail that you need to set the `writeable` flag to `False` -- which is a big deal if you want to mutate these arrays (which I assume is an important point of using them). – ely Jan 03 '15 at 02:54
@prpl.mnky.dshwshr after all is said + done, I've cleaned this up quite a bit and introduced a dict comprehension. Let me know if there are any more comments/concerns. (I'm deleting extraneous comments to clean this up as well) – Joel Jan 03 '15 at 06:05
My only further comment, which links up with what I said at the beginning, is that this isn't safe against mutability unless you can always guarantee that no processing happens to `L` in between its creation and the step where everything is dumped to a `list` via `L.values`. In between those steps, you can access and modify `L[a.tostring()]` so that the value for that key is different than the stringification of `a`, e.g. if `a` is 2-D: `L[a.tostring()][0,0] = np.NaN` for example. I think the one liner using `tuple` is a bit safer in this regard (no intermediate chance to mutate). – ely Jan 03 '15 at 15:53

score 5 · Answer 2 · answered Jan 03 '15 at 16:29

Depending on the structure of your data, it may be quicker to directly compare all the arrays rather than finding some way to hash the arrays. The algorithm is O(n^2), but each individual comparison wil be much quicker than creating strings or python lists of your arrays. So it depends how many arrays you have to check.

eg.

uniques = []
for arr in possible_duplicates:
    if not any(numpy.array_equal(arr, unique_arr) for unique_arr in uniques):
        uniques.append(arr)

ely · Answer 3 · 2015-01-03T16:19:48.173

Here is one way using tuple:

>>> import numpy as np
>>> t = [np.asarray([1, 2, 3, 4]), 
         np.asarray([1, 2, 3, 4]), 
         np.asarray([1, 1, 3, 4])]

>>> map(np.asarray, set(map(tuple, t)))
[array([1, 1, 3, 4]), array([1, 2, 3, 4])]

If your arrays are multidimensional, then first flatten them to a 1-by-whatever array, then use the same idea, and reshape them at the end:

def to_tuple(arr):
    return tuple(arr.reshape((arr.size,)))

def from_tuple(tup, original_shape):
    np.asarray(tup).reshape(original_shape)

Example:

In [64]: t = np.asarray([[[1,2,3],[4,5,6]],
                         [[1,1,3],[4,5,6]],
                         [[1,2,3],[4,5,6]]])

In [65]: map(lambda x: from_tuple(x, t[0].shape), set(map(to_tuple, t)))
Out[65]: 
[array([[1, 2, 3],
        [4, 5, 6]]), 
 array([[1, 1, 3],
        [4, 5, 6]])]

Another option is to create a pandas.DataFrame out of your ndarrays (treating them as rows by reshaping if needed) and using the pandas built-ins for uniquifying the rows.

In [34]: t
Out[34]: [array([1, 2, 3, 4]), array([1, 2, 3, 4]), array([1, 1, 3, 4])]

In [35]: pandas.DataFrame(t).drop_duplicates().values
Out[35]: 
array([[1, 2, 3, 4],
       [1, 1, 3, 4]])

Overall, I think it's a bad idea to try to use tostring() as a quasi hash function, because you'll need more boiler plate code than in my approach just to protect against the possibility that some of the contents are mutated after they've been assigned their "hash" key in some dict.

If the reshaping and converting to tuple is too slow given the size of the data, my feeling is that this reveals a more fundamental problem: the application isn't designed well around the needs (like de-duping) and trying to cram them into some Python process running in memory is probably not the right way. At that point, I would stop to consider whether something like Cassandra, which can easily build database indices on top of large columns (or multidimensional arrays) of floating point (or other) data isn't the more sane approach.

Hmm. I tried using np.ndarray.tolist but lists aren't hashable. I'll look into it further; my arrays are multidimensional so just using 'tuple' still causes an error on the subarrays. — SoItBegins, Jan 03 '15 at 02:39
@SoItBegins Why are you using `tolist`? Just use the code from my example. I'm not using `tolist` above .. I'm using just plain `tuple`. — ely, Jan 03 '15 at 02:42
I can't use tuple because the numpy arrays are multidimensional. Using tuple() creates a tuple with smaller numpy arrays inside it, which triggers the same error. — SoItBegins, Jan 03 '15 at 03:02
You could reshape them to a long 1-by-whatever array, then use tuple, then reshape them back to their original shape at the end. — ely, Jan 03 '15 at 15:46

Removing duplicates from a list of numPy arrays

3 Answers3

Linked

Related