How to remove duplicate elements from list of numpy arrays?

Question

I have a list of numpy arrays. How can I can remove duplicate arrays from the list?

I tried set(arrays) but got the error "TypeError: unhashable type: 'numpy.ndarray"

Example with 2d arrays (mine are actually 3d). Here the starting list is length 10. The output list of distinct arrays should be length 8, because the elements at indexes 0, 5, 9 are all equal.

>>> import numpy
>>> numpy.random.seed(0)
>>> arrays = [numpy.random.randint(2, size=(2,2)) for i in range(10)]
>>> numpy.array_equal(arrays[0], arrays[5])
True
>>> numpy.array_equal(arrays[5], arrays[9])
True

Did you look here: http://stackoverflow.com/questions/27751072/removing-duplicates-from-a-list-of-numpy-arrays ? — P. Camilleri, Oct 19 '15 at 15:12
Thanks. At the end I want a list of arrays, not a list of strings, but one of other answers helped. — Colonel Panic, Oct 19 '15 at 15:21

score 1 · Answer 1 · answered Oct 19 '15 at 15:19

1

In the end, looped over the list comparing with numpy.array_equal

distinct = list()
for M in arrays:
    if any(numpy.array_equal(M, N) for N in distinct):
        continue
    distinct.append(M)

It's O(n**2) but what the hey.

answered Oct 19 '15 at 15:19

Colonel Panic

132,665
89
401
465

Divakar · Accepted Answer · 2015-10-19T18:08:29.103

You can start off by collecting all those arrays from the input list into a NumPy array. Then, lex-sort it, which would bring all the duplicate rows in consecutive order. Then, do differentiation along the rows, giving us all zeros for duplicate rows, which could be extracted using (sorted_array==0).all(1). This would give you a mask of starting positions of duplicates, which could be used to select elements from the concatenated array. Finally, the selected elements are reshaped and sent back to a list of arrays format by splitting along the first axis. Thus, you would have a vectorized implementation, like so -

A = numpy.concatenate((arrays)).reshape(-1,arrays[0].size)

sortedA =  A[numpy.lexsort(A.T)]

idx = numpy.append(True,~(numpy.diff(sortedA,axis=0)==0).all(1))

out = numpy.vsplit((A.reshape((len(arrays),) + arrays[0].shape))[idx],idx.sum())

Sample input, output -

In [238]: arrays
Out[238]: 
[array([[0, 1],
        [1, 0]]), array([[1, 1],
        [1, 1]]), array([[1, 1],
        [1, 0]]), array([[0, 1],
        [0, 0]]), array([[0, 0],
        [0, 1]]), array([[0, 1],
        [1, 0]]), array([[0, 1],
        [1, 1]]), array([[1, 0],
        [1, 0]]), array([[1, 0],
        [1, 1]]), array([[0, 1],
        [1, 0]])]

In [239]: out
Out[239]: 
[array([[[0, 1],
         [1, 0]]]), array([[[1, 1],
         [1, 1]]]), array([[[1, 1],
         [1, 0]]]), array([[[0, 1],
         [1, 0]]]), array([[[0, 1],
         [1, 1]]]), array([[[1, 0],
         [1, 0]]]), array([[[1, 0],
         [1, 1]]]), array([[[0, 1],
         [1, 0]]])]

score 0 · Answer 3 · answered Oct 19 '15 at 16:22

You can use tostring and fromstring to convert to and from hashable items (byte strings). You can put them in a set:

>>> arrs = [np.random.random(10) for _ in range(10)]
>>> arrs += arrs  # create duplicate items
>>> 
>>> darrs = set((arr.tostring(), arr.dtype) for arr in arrs)
>>> uniq_arrs = [np.fromstring(arr, dtype=dtype) for arr, dtype in darrs]

How to remove duplicate elements from list of numpy arrays?

3 Answers3