2D Vectorization of unique values per row with condition

Question

Consider the array and function definition shown:

import numpy as np

a = np.array([[2, 2, 5, 6, 2, 5],
              [1, 5, 8, 9, 9, 1],
              [0, 4, 2, 3, 7, 9],
              [1, 4, 1, 1, 5, 1],
              [6, 5, 4, 3, 2, 1],
              [3, 6, 3, 6, 3, 6],
              [0, 2, 7, 6, 3, 4],
              [3, 3, 7, 7, 3, 3]])

def grpCountSize(arr, grpCount, grpSize):    
    count = [np.unique(row, return_counts=True)  for row in arr]
    valid = [np.any(np.count_nonzero(row[1] == grpSize) == grpCount) for row in count]
    return valid

The point of the function is to return the rows of array a that have exactly grpCount groups of elements that each hold exactly grpSize identical elements.

For example:

# which rows have exactly 1 group that holds exactly 2 identical elements?
out = a[grpCountSize(a, 1, 2)]

As expected, the code outputs out = [[2, 2, 5, 6, 2, 5], [3, 3, 7, 7, 3, 3]]. The 1st output row has exactly 1 group of 2 (ie: 5,5), while the 2nd output row also has exactly 1 group of 2 (ie: 7,7).

Similarly:

# which rows have exactly 2 groups that each hold exactly 3 identical elements?
out = a[grpCountSize(a, 2, 3)]

This produces out = [[3, 6, 3, 6, 3, 6]], because only this row has exactly 2 groups each holding exactly 3 elements (ie: 3,3,3 and 6,6,6)

PROBLEM: My actual arrays have just 6 columns, but they can have many millions of rows. The code works perfectly as intended, but it is VERY SLOW for long arrays. Is there a way to speed this up?

The big speedup in `numpy` comes from applying compiled methods to the whole array. Your code has to work row by row; if you can't figure out a way of working with multiple rows at one, you stuck with trying to speed up that row-by-row operation. I suspect you can apply the `unique` and `count` operations together to one row, but have no sense whether than will improve times compared to the current 2 loops. The other option is use `numba`, but I won't explore that. — hpaulj, Feb 04 '21 at 02:16

Ehsan · Accepted Answer · 2021-02-04T03:02:09.253

np.unique sorts the array which makes it less efficient for your purpose. Use np.bincount and that way you most likely will save some time(depending on your array shape and values in the array). You also will not need np.any anymore:

def grpCountSize(arr, grpCount, grpSize):    
    count = [np.bincount(row) for row in arr]
    valid = [np.count_nonzero(row == grpSize) == grpCount for row in count]
    return valid

Another way that might even save more time is using same number of bins for all rows and create one array:

def grpCountSize(arr, grpCount, grpSize):
    m = arr.max()
    count = np.stack([np.bincount(row, minlength=m+1) for row in arr])
    return (count == grpSize).sum(1)==grpCount

Another yet upgrade is to use vectorized 2D bin count from this post. For example (note that Numba solutions tested in the post above is faster. I just provided the numpy solution for example. You can replace the function with any of the suggested ones in the post linked above):

def grpCountSize(arr, grpCount, grpSize):
    count = bincount2D_vectorized(arr)
    return (count == grpSize).sum(1)==grpCount
#from the post above
def bincount2D_vectorized(a):    
    N = a.max()+1
    a_offs = a + np.arange(a.shape[0])[:,None]*N
    return np.bincount(a_offs.ravel(), minlength=a.shape[0]*N).reshape(-1,N)

output of all solutions above:

a[grpCountSize2(a, 1, 2)]
#array([[2, 2, 5, 6, 2, 5],
#       [3, 3, 7, 7, 3, 3]])

If you just want to use Numba without getting into its details, simply replace `bincount2D_vectorized` with any Numba functions in the link provided. — Ehsan, Feb 04 '21 at 06:56

2D Vectorization of unique values per row with condition

1 Answers1