Fastest way to count identical sub-arrays in a nd-array?

Question

Let's consider a 2d-array A

2   3   5   7
2   3   5   7
1   7   1   4
5   8   6   0
2   3   5   7

The first, second and last lines are identical. The algorithm I'm looking for should return the number of identical rows for each different row (=number of duplicates of each element). If the script can be easily modified to also count the number of identical column also, it would be great.

I use an inefficient naive algorithm to do that:

import numpy
A=numpy.array([[2,  3,  5,  7],[2,  3,  5,  7],[1,  7,  1,  4],[5,  8,  6,  0],[2,  3,  5,  7]])
i=0
end = len(A)
while i<end:
    print i,
    j=i+1
    numberID = 1
    while j<end:
        print j
        if numpy.array_equal(A[i,:] ,A[j,:]):
            numberID+=1
        j+=1
    i+=1
print A, len(A)

Expected result:

array([3,1,1]) # number identical arrays per line

My algo looks like using native python within numpy, thus inefficient. Thanks for help.

considering the not-that-easy answer of Jaime (hope for more detailed explanations if someone has time to?), it would be pedagogic for novice like me to not consider this question as duplicate ;-) — sol, Oct 16 '14 at 09:21

score 3 · Accepted Answer · edited May 23 '17 at 12:20

3

In unumpy >= 1.9.0, np.unique has a return_counts keyword argument you can combine with the solution here to get the counts:

b = np.ascontiguousarray(A).view(np.dtype((np.void, A.dtype.itemsize * A.shape[1])))
unq_a, unq_cnt = np.unique(b, return_counts=True)
unq_a = unq_a.view(A.dtype).reshape(-1, A.shape[1])

>>> unq_a
array([[1, 7, 1, 4],
       [2, 3, 5, 7],
       [5, 8, 6, 0]])

>>> unq_cnt
array([1, 3, 1])

In an older numpy, you can replicate what np.unique does, which would look something like:

a_view = np.array(A, copy=True)
a_view = a_view.view(np.dtype((np.void,
                               a_view.dtype.itemsize*a_view.shape[1]))).ravel()
a_view.sort()
a_flag = np.concatenate(([True], a_view[1:] != a_view[:-1]))
a_unq = A[a_flag]
a_idx = np.concatenate(np.nonzero(a_flag) + ([a_view.size],))
a_cnt = np.diff(a_idx)

>>> a_unq
array([[1, 7, 1, 4],
       [2, 3, 5, 7],
       [5, 8, 6, 0]])

>>> a_cnt
array([1, 3, 1])

edited May 23 '17 at 12:20

Community

1
1

answered Oct 15 '14 at 20:48

Jaime

65,696
17
124
159

1

thank you, really hard to understand the algo and logic considering the one-line formalization, but I will work on untill death to understand ;-) – sol Oct 16 '14 at 09:05
can you explain what you do and why to get b before applying numpy.unique() please? – sol Oct 21 '14 at 14:02
1

It is explained in the comments of the question linked. Basically, it creates a view of the data of type `np.void` and size the number of bytes in a full row. Hence, every row appears to numpy as one single item. – Jaime Oct 21 '14 at 15:44
ok, some tests and handlings help me to understand. TY again Jaime! – sol Oct 22 '14 at 09:07

score 2 · Answer 2 · answered Oct 15 '14 at 16:37

You can lexsort on the row entries, which will give you the indices for traversing the rows in sorted order, making the search O(n) rather than O(n^2). Note that by default, the elements in the last column sort last, i.e. the rows are 'alphabetized' right to left rather than left to right.

In [9]: a
Out[9]: 
array([[2, 3, 5, 7],
       [2, 3, 5, 7],
       [1, 7, 1, 4],
       [5, 8, 6, 0],
       [2, 3, 5, 7]])

In [10]: lexsort(a.T)
Out[10]: array([3, 2, 0, 1, 4])

In [11]: a[lexsort(a.T)]
Out[11]: 
array([[5, 8, 6, 0],
       [1, 7, 1, 4],
       [2, 3, 5, 7],
       [2, 3, 5, 7],
       [2, 3, 5, 7]])

your proposal sorts sub-arrays but doesn't count them ;-) – sol Oct 16 '14 at 09:18 — sol, Oct 16 '14 at 09:18

score 1 · Answer 3 · answered Oct 15 '14 at 14:30

You can use Counter class from collections module for this.

It works like this :

x = [2, 2, 1, 5, 2]
from collections import Counter
c=Counter(x)
print c

Output : Counter({2: 3, 1: 1, 5: 1})

Only issue you will face is in your case since every value of x is itself a list which is a non hashable data structure. If you can convert every value of x in a tuple that it should works as :

x = [(2,  3,  5,  7),(2,  3,  5,  7),(1,  7,  1,  4),(5,  8,  6,  0),(2,  3,  5,  7)]
from collections import Counter
c=Counter(x)
print c

Output : Counter({(2, 3, 5, 7): 3, (5, 8, 6, 0): 1, (1, 7, 1, 4): 1})

Counter doesn't work on numpy object (a pity) so we need to convert into lists. Seems for me that converting numpy nd-array to nd-list would be time consuming but thank for the pythonic rather than numpic method. — sol, Oct 16 '14 at 09:17

Fastest way to count identical sub-arrays in a nd-array?

3 Answers3

Linked

Related