1

I have a list of numpy arrays. How can I can remove duplicate arrays from the list?

I tried set(arrays) but got the error "TypeError: unhashable type: 'numpy.ndarray"

Example with 2d arrays (mine are actually 3d). Here the starting list is length 10. The output list of distinct arrays should be length 8, because the elements at indexes 0, 5, 9 are all equal.

>>> import numpy
>>> numpy.random.seed(0)
>>> arrays = [numpy.random.randint(2, size=(2,2)) for i in range(10)]
>>> numpy.array_equal(arrays[0], arrays[5])
True
>>> numpy.array_equal(arrays[5], arrays[9])
True
Colonel Panic
  • 132,665
  • 89
  • 401
  • 465

3 Answers3

1

In the end, looped over the list comparing with numpy.array_equal

distinct = list()
for M in arrays:
    if any(numpy.array_equal(M, N) for N in distinct):
        continue
    distinct.append(M)

It's O(n**2) but what the hey.

Colonel Panic
  • 132,665
  • 89
  • 401
  • 465
1

You can start off by collecting all those arrays from the input list into a NumPy array. Then, lex-sort it, which would bring all the duplicate rows in consecutive order. Then, do differentiation along the rows, giving us all zeros for duplicate rows, which could be extracted using (sorted_array==0).all(1). This would give you a mask of starting positions of duplicates, which could be used to select elements from the concatenated array. Finally, the selected elements are reshaped and sent back to a list of arrays format by splitting along the first axis. Thus, you would have a vectorized implementation, like so -

A = numpy.concatenate((arrays)).reshape(-1,arrays[0].size)

sortedA =  A[numpy.lexsort(A.T)]

idx = numpy.append(True,~(numpy.diff(sortedA,axis=0)==0).all(1))

out = numpy.vsplit((A.reshape((len(arrays),) + arrays[0].shape))[idx],idx.sum())

Sample input, output -

In [238]: arrays
Out[238]: 
[array([[0, 1],
        [1, 0]]), array([[1, 1],
        [1, 1]]), array([[1, 1],
        [1, 0]]), array([[0, 1],
        [0, 0]]), array([[0, 0],
        [0, 1]]), array([[0, 1],
        [1, 0]]), array([[0, 1],
        [1, 1]]), array([[1, 0],
        [1, 0]]), array([[1, 0],
        [1, 1]]), array([[0, 1],
        [1, 0]])]

In [239]: out
Out[239]: 
[array([[[0, 1],
         [1, 0]]]), array([[[1, 1],
         [1, 1]]]), array([[[1, 1],
         [1, 0]]]), array([[[0, 1],
         [1, 0]]]), array([[[0, 1],
         [1, 1]]]), array([[[1, 0],
         [1, 0]]]), array([[[1, 0],
         [1, 1]]]), array([[[0, 1],
         [1, 0]]])]
Divakar
  • 218,885
  • 19
  • 262
  • 358
0

You can use tostring and fromstring to convert to and from hashable items (byte strings). You can put them in a set:

>>> arrs = [np.random.random(10) for _ in range(10)]
>>> arrs += arrs  # create duplicate items
>>> 
>>> darrs = set((arr.tostring(), arr.dtype) for arr in arrs)
>>> uniq_arrs = [np.fromstring(arr, dtype=dtype) for arr, dtype in darrs]
TheBlackCat
  • 9,791
  • 3
  • 24
  • 31