Efficiently count the number of occurrences of unique subarrays in NumPy?

Question

I have an array of shape (128, 36, 8) and I'd like to find the number of occurrences of the unique subarrays of length 8 in the last dimension.

I'm aware of np.unique and np.bincount, but those seem to be for elements rather than subarrays. I've seen this question but it's about finding the first occurrence of a particular subarray, rather than the counts of all unique subarrays.

I couldn't think up a way to do it within numpy, but would a [trie](https://en.wikipedia.org/wiki/Trie) be too slow? It would only need to access each element once and then at the end you automatically have the number of unique subarrays as well as their locations if you stored them. — KobeJohn, Jun 16 '15 at 23:51
Here is a closely related question, http://stackoverflow.com/questions/8560440/removing-duplicate-columns-and-rows-from-a-numpy-2d-array. The basic idea is you sort the subarrays (lexicographic sort). Once the similar subarrays are grouped it's trivial to identify and count them. — Bi Rico, Jun 17 '15 at 00:07

Divakar · Accepted Answer · 2015-06-18T05:38:34.187

The question states that the input array is of shape (128, 36, 8) and we are interested in finding unique subarrays of length 8 in the last dimension. So, I am assuming that the uniqueness is along the first two dimensions being merged together. Let us assume A as the input 3D array.

Get the number of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar =  Ar[sorted_idx,:]

# Get the count of rows that have at least one TRUE value 
# indicating presence of unique subarray there
unq_out = np.any(np.diff(sorted_Ar,axis=0),1).sum()+1

Sample run -

In [159]: A # A is (2,2,3)
Out[159]: 
array([[[0, 0, 0],
        [0, 0, 2]],

       [[0, 0, 2],
        [2, 0, 1]]])

In [160]: unq_out
Out[160]: 3

Get the count of occurrences of unique subarrays

# Reshape the 3D array to a 2D array merging the first two dimensions
Ar = A.reshape(-1,A.shape[2])

# Perform lex sort and get the sorted indices and xy pairs
sorted_idx = np.lexsort(Ar.T)
sorted_Ar =  Ar[sorted_idx,:]

# Get IDs for each element based on their uniqueness
id = np.append([0],np.any(np.diff(sorted_Ar,axis=0),1).cumsum())

# Get counts for each ID as the final output
unq_count = np.bincount(id)

Sample run -

In [64]: A
Out[64]: 
array([[[0, 0, 2],
        [1, 1, 1]],

       [[1, 1, 1],
        [1, 2, 0]]])

In [65]: unq_count
Out[65]: array([1, 2, 1], dtype=int64)

This is great—I hadn't thought to use `np.lexsort` and I didn't know about `np.diff`—but we're actually interested in finding the *number of occurences* of unique subarrays, not just the number of unique subarrays. Could this method be adapted to return the unique subarrays along with their counts, as @farhawa's answer? — Will, Jun 17 '15 at 18:28
Awesome, thank you. BTW, my modification of your original answer seems to be slightly faster than your extension: ~668 µs vs ~685 µs. — Will, Jun 18 '15 at 15:43
@Will Great! How about test it on a larger dataset, something like `(1000, 1000, 8)`, if possible? — Divakar, Jun 19 '15 at 05:36
I got similar results with a larger dataset: 423ms vs 433ms. — Will, Jun 25 '15 at 17:44

Will · Answer 2 · 2015-06-26T22:38:19.987

Here I've modified @Divakar's very useful answer to return the counts of the unique subarrays, as well as the subarrays themselves, so that the output is the same as that of collections.Counter.most_common():

# Get the array in 2D form.
arr = arr.reshape(-1, arr.shape[-1])

# Lexicographically sort
sorted_arr = arr[np.lexsort(arr.T), :]

# Get the indices where a new row appears
diff_idx = np.where(np.any(np.diff(sorted_arr, axis=0), 1))[0]

# Get the unique rows
unique_rows = [sorted_arr[i] for i in diff_idx] + [sorted_arr[-1]]

# Get the number of occurences of each unique array (the -1 is needed at
# the beginning, rather than 0, because of fencepost concerns)
counts = np.diff(
    np.append(np.insert(diff_idx, 0, -1), sorted_arr.shape[0] - 1))

# Return the (row, count) pairs sorted by count
return sorted(zip(unique_rows, counts), key=lambda x: x[1], reverse=True)

farhawa · Answer 3 · 2015-06-16T23:21:53.837

0

I am not sure that it's the most efficient way to do it but this should work.

arr = arr.reshape(128*36,8)
unique_ = []
occurence_ = []

for sub in arr:
    if sub.tolist() not in unique_:
        unique_.append(sub.tolist())
        occurence_.append(1)
    else:
        occurence_[unique_.index(sub.tolist())]+=1
for index_,u in unique_:
   print u,"occurrence: %s"%occurence_[index_]

edited Jun 16 '15 at 23:21

answered Jun 16 '15 at 23:14

farhawa

10,120
16
49
91

This will work but I was looking to avoid functions that use native Python such as `tolist` and `index`, which are expensive. Thanks for the answer though. – Will Jun 16 '15 at 23:27
An obvious optimization to your method, by the way, would be to keep the counts in a dictionary where the keys are tuples of the subarrays, rather than in a list that we need to keep searching through with `unique_.index`. – Will Jun 16 '15 at 23:29
1

@Will or even better, use `collections.Counter`, `counts = Counter(tuple(row) for row in arr)` :) – Bi Rico Jun 17 '15 at 02:47
@BiRico, that's great, didn't know about that builtin! – Will Jun 17 '15 at 16:49

Efficiently count the number of occurrences of unique subarrays in NumPy?

3 Answers3

Linked