Consider the array and function definition shown:
import numpy as np
a = np.array([[2, 2, 5, 6, 2, 5],
[1, 5, 8, 9, 9, 1],
[0, 4, 2, 3, 7, 9],
[1, 4, 1, 1, 5, 1],
[6, 5, 4, 3, 2, 1],
[3, 6, 3, 6, 3, 6],
[0, 2, 7, 6, 3, 4],
[3, 3, 7, 7, 3, 3]])
def grpCountSize(arr, grpCount, grpSize):
count = [np.unique(row, return_counts=True) for row in arr]
valid = [np.any(np.count_nonzero(row[1] == grpSize) == grpCount) for row in count]
return valid
The point of the function is to return the rows of array a
that have exactly grpCount
groups of elements that each hold exactly grpSize
identical elements.
For example:
# which rows have exactly 1 group that holds exactly 2 identical elements?
out = a[grpCountSize(a, 1, 2)]
As expected, the code outputs out = [[2, 2, 5, 6, 2, 5], [3, 3, 7, 7, 3, 3]]
.
The 1st output row has exactly 1 group of 2 (ie: 5,5), while the 2nd output row also has exactly 1 group of 2 (ie: 7,7).
Similarly:
# which rows have exactly 2 groups that each hold exactly 3 identical elements?
out = a[grpCountSize(a, 2, 3)]
This produces out = [[3, 6, 3, 6, 3, 6]]
, because only this row has exactly 2 groups each holding exactly 3 elements (ie: 3,3,3 and 6,6,6)
PROBLEM: My actual arrays have just 6 columns, but they can have many millions of rows. The code works perfectly as intended, but it is VERY SLOW for long arrays. Is there a way to speed this up?