0

What is the best way to count the rows in a 2d numpy array that include all values of another 1d numpy array? The 2nd array can have more columns than the length of the 1d array.

elements = np.arange(4).reshape((2, 2))
test_elements = [2, 3]
somefunction(elements, test_elements)

I would expect the function to return 1.

elements = np.arange(15).reshape((5, 3))

# array([[ 0,  1,  2],
#       [ 3,  4,  5],
#       [ 6,  7,  8],
#       [ 9, 10, 11],
#       [12, 13, 14]])

test_elements = [4, 3]
somefunction(elements, test_elements)

Should also return 1.

All elements of the 1d array must be included. If only a few elements are found in a row, it doesn't count. Hence:

elements = np.arange(15).reshape((5, 3))

# array([[ 0,  1,  2],
#       [ 3,  4,  5],
#       [ 6,  7,  8],
#       [ 9, 10, 11],
#       [12, 13, 14]])

test_elements = [3, 4, 10]
somefunction(elements, test_elements)

Should also return 0.

DavidJM
  • 1
  • 2

4 Answers4

0

Create a boolean array of elements found then use any row-wise this will avoid multiple values in the same row and at last count the rows by using sum,

np.any(np.isin(elements, test), axis=1).sum()

Output

>>> elements
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14]])
>>> test = [1, 6, 7, 4]
>>> np.any(np.isin(elements, test), axis=1).sum()
3
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • Thanks. But that counts all rows that include any values of the 1d array. I'm looking for something that requires that ALL values of the 1d array must be in the row of the 2d array. – DavidJM Jul 30 '19 at 16:18
  • @DavidJM Could you justify your second example? – Vishnudev Krishnadas Jul 30 '19 at 16:20
  • In second example, the test_elements are [4, 3]. Both 3 and 4 are included in the second row in elements. Order doesn't matter. – DavidJM Jul 30 '19 at 17:43
0

(EDIT: OK, now I actually had a bit more time to figure out what is going on.)


The are two issues here:

  1. the computational complexity depends on the sizes of both inputs and it is not captured well by a 1D benchmark plot
  2. the actual timing are dominated by variation in the inputs

The problem can be separated in two parts:

  1. looping through the rows
  2. performing the subset check, which is basically a nested-loop quadratic operation (in the worst-case scenario)

We know that, for sufficiently large inputs, looping through the rows is faster in NumPy and slower in pure Python.

For reference, let's consider these two approaches:

# pure Python approach
def all_in_by_row_flt(arr, elems=ELEMS):
    return sum(1 for row in arr if all(e in row for e in elems))

# NumPy apprach (based on @Mstaino answer)
def all_in_by_row_np(arr, elems=ELEMS):
    def _aaa_helper(row, e=elems):
        return np.isin(e, row)
    return np.sum(np.all(np.apply_along_axis(_aaa_helper, 1, arr), 1))

Then, considering the subset check operation, if the input is such that the check is performed within fewer loops, pure Python looping gets faster than NumPy. Conversely, if a sufficiently large number of loops is required, then NumPy can actually be faster. On top of this, there is the looping through the rows, but because the subset check operation is quadratic AND the have different constant coefficients, there are situations for which, despite the rows-looping being faster in NumPy (because the number of rows would be sufficiently large), the overall operation is faster in pure Python. This was the situation I was running into in the earlier benchmarks, and corresponds to the situation where the subset check is always (or almost) False and it does fail within few loops. As soon as the subset check starts requiring more loops, the Python only approach begins to lag behind and for the situation where the subset check is actually True for most (if not all) the rows, the NumPy approach is actually faster.

Another key difference between the NumPy and the pure Python approach is that pure Python uses lazy evaluation and NumPy does not, and actually require the creation of potentially large intermediate objects that slow down the computation. On top of this, NumPy iterates over the rows twice (one in sum() and one in np.apply_along_axis()), while the pure Python approaches only once.


Other approaches using set().issubset() like e.g. from @GZ0 answer:

def all_in_by_row_set(arr, elems=ELEMS):
    elems = set(elems)
    return sum(map(elems.issubset, row))

have different timings than the explicitly writing of the nested-loop when it comes to subset checking, but they still suffer from slower outer looping.


So, what's next?

The answer is to use Cython or Numba. The idea is to get NumPy-like (read: C) speed all the times (and not only for sufficiently large inputs), lazy evaluation and minimal number of looping through the rows.

An example of a Cython approach (as implemented in IPython, using the %load_ext Cython magic) is:

%%cython --cplus -c-O3 -c-march=native -a
#cython: language_level=3, boundscheck=False, wraparound=False, initializedcheck=False, cdivision=True


cdef long all_in_by_row_c(long[:, :] arr, long[:] elems) nogil:
    cdef long result = 0
    I = arr.shape[0]
    J = arr.shape[1]
    K = elems.shape[0]
    for i in range(I):
        is_subset = True
        for k in range(K):
            is_contained = False
            for j in range(J):
                if elems[k] == arr[i, j]:
                    is_contained = True
                    break
            if not is_contained:
                is_subset = False
                break
        result += 1 if is_subset else 0
    return result


def all_in_by_row_cy(long[:, :] arr, long[:] elems):
    return all_in_by_row_c(arr, elems)

While a similar Numba code reads:

import numba as nb


@nb.jit(nopython=True, nogil=True)
def all_in_by_row_jit(arr, elems=ELEMS):
    result = 0
    n_rows, n_cols = arr.shape
    for i in range(n_rows):
        is_subset = True
        for e in elems:
            is_contained = False
            for r in arr[i, :]:
                if e == r:
                    is_contained = True
                    break
            if not is_contained:
                is_subset = False
                break
        result += 1 if is_subset else 0
    return result

Now, time-wise we get to the following (for relatively small number of rows):

arr.shape=(100, 1000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy  120 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_jit 129 µs ± 131 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_flt 2.44 ms ± 13.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_set 9.98 ms ± 52.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_np  13.7 ms ± 52.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

arr.shape=(100, 2000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy  1.45 ms ± 24.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_jit 1.52 ms ± 4.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_flt 30.1 ms ± 452 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_set 19.8 ms ± 56.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_np  18 ms ± 28.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

arr.shape=(100, 3000) elems.shape=(1000,) result=37
Func: all_in_by_row_cy  10.4 ms ± 31.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_jit 10.9 ms ± 13.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_flt 226 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 30.5 ms ± 92.9 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np  21.9 ms ± 87.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

arr.shape=(100, 4000) elems.shape=(1000,) result=86
Func: all_in_by_row_cy  16.8 ms ± 32.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_jit 17.7 ms ± 42 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Func: all_in_by_row_flt 385 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 39.5 ms ± 588 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np  25.7 ms ± 128 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Now that the slow down of the last block cannot be explained by the increased input size in the second dimension. Actually, if the short-circuit rate is increased (e.g. by changing the values range of the random arrays), for the last block (same input sizes) one gets:

arr.shape=(100, 4000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy   152 µs ± 1.89 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_jit  173 µs ± 4.72 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Func: all_in_by_row_flt  556 µs ± 8.56 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Func: all_in_by_row_set  39.7 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_np   31.5 ms ± 315 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Note that set()-based method is kind of independent on the short-circuit rate (because of the hash-based implementation which has ~O(1) check for presence complexity, but this comes at the expenses of hashing pre-computation and these results indicate this might not be faster than the direct nested-looping approach).

Finally, for larger rows counts :

arr.shape=(100000, 1000) elems.shape=(1000,) result=0
Func: all_in_by_row_cy  141 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_jit 150 ms ± 1.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Func: all_in_by_row_flt 2.6 s ± 28.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 10.1 s ± 216 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np  13.7 s ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

arr.shape=(100000, 2000) elems.shape=(1000,) result=34
Func: all_in_by_row_cy  1.2 s ± 753 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 1.27 s ± 7.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 24.1 s ± 119 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 19.5 s ± 270 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np  18 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

arr.shape=(100000, 3000) elems.shape=(1000,) result=33859
Func: all_in_by_row_cy  9.79 s ± 11.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 10.3 s ± 5.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 3min 30s ± 1.13 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 30 s ± 57.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np  21.9 s ± 59.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

arr.shape=(100000, 4000) elems.shape=(1000,) result=86376
Func: all_in_by_row_cy  17 s ± 30.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_jit 17.9 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_flt 6min 29s ± 293 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_set 38.9 s ± 33 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Func: all_in_by_row_np  25.7 s ± 29.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Finally, note that the Cython/Numba code may be algorithmically optimized.

norok2
  • 25,683
  • 4
  • 73
  • 99
  • That's the implementation that I currently have. But it's quite slow when the 2d array is bigger. Was hopping for some magically way that is faster ;) – DavidJM Jul 30 '19 at 16:06
  • @DavidJM check out the update benchmarks, but this would depend on how much big is big for you. Perhaps you can give some indication on the sizes of your objects. My workstation will not fit much larger objects into memory though, but ~500 ms for crunching ~80k rows doesn't seem too bad to me :-) – norok2 Jul 30 '19 at 17:58
  • @DavidJM With numbers close to your use-case, I know see why you were disappointed with `set()`. However, check out the *new* proposed approach (which was on par with `set()` with the previous battery of tests), using a nested comprehension instead. – norok2 Jul 31 '19 at 17:51
  • @DavidJM OK, I actually messed up the analysis a bit in earlier edits, please look at this *new* one. – norok2 Aug 01 '19 at 09:22
0

There's probably a more efficient solution, but if you want the rows where "all" elements of test_elements are present, you can reverse np.isin and apply it along each row, with the following:

np.apply_along_axis(lambda x: np.isin(test_elements, x), 1, elements).all(1).sum()
Tarifazo
  • 4,118
  • 1
  • 9
  • 22
0

A slightly more efficient (but less readable) variant of @norok2's solution is the following.

sum(map(set(test_elements).issubset, elements))
GZ0
  • 4,055
  • 1
  • 10
  • 21
  • I would argue it is actually less efficient, since `sum()` will be given all the rows to process, even the `0`/ `False` ones. On the other hand, a `filter()` based method would be much better because `0`s do not reach `sum()`, but that's substantially equivalent to the *filtered* comprehension generator syntax (note that there is no intermediate container being created). – norok2 Jul 31 '19 at 15:37
  • @norok2 That is actually not the case in my tests. `sum(arr)` runs more than twice as fast as `sum(1 for e in arr if e)` given either `arr = [False] * 1000000` or `arr = [True] * 1000000`. The `filter`-based approach you proposed runs faster than `sum(arr)` on the `False` array but slower on the `True` array. – GZ0 Jul 31 '19 at 16:09
  • @norok2 Explicit generator expressions are compiled into (unnamed) functions. This introduces quite some function call overhead. – GZ0 Jul 31 '19 at 16:22
  • that is not a *apple* to *apple* comparison: (1) when you do `sum(arr)` all your looping lives in C; (2) `sum(arr)` does not include the logic associated to the `if`. Anyway for your peace of mind, I am including your proposed approach in my benchmarks. – norok2 Jul 31 '19 at 16:28
  • that is correct, but `map()` does have the *same* overhead, something that a plain `sum()` call does not have! – norok2 Jul 31 '19 at 16:29
  • If you want to do comparison that includes all the logic of `sum(1 for e in arr if e)`, compare it with `sum(filter(bool, arr))` and `sum(map(bool, arr))`. You will see that `filter()` comes out as much faster than `map()`. – norok2 Jul 31 '19 at 16:38
  • @norok2 Your proposed comparison is problematic as well because for the general case, the logic of `sum(1 for e in arr if pred(e))` is equivalent to `sum(map(pred, e))` and `sum(1 for _ in filter(pred, arr))`. You cannot just write `sum(filter(pred, arr))` unless `arr` itself is a boolean array. I tried the comparison and the result varies on different `pred`s. The `map` approach is actually faster than `filter` for complex `pred`s such as `set.__contains__`. – GZ0 Jul 31 '19 at 17:00
  • at this point I am not sure what you are referring to, but `sum(filter(set(test_elements).issubset, elements))` is 100% valid and workable code AND it is (slightly, in the *average* case) faster than the `map()` counterpart. The equivalent comprehensions are, [for the most part](https://stackoverflow.com/questions/1247486/list-comprehension-vs-map), equivalent to the `map()` and `filter()` respectively. – norok2 Jul 31 '19 at 17:23
  • 1
    @norok2 `sum(filter(set(test_elements).issubset, elements))` is a valid statement but it does not return the expected result if the result is greater than `0`. The `filter` call returns an iterable over actual elements in `elements` rather than boolean values. Then sum will be taken over the returned rows instead of just `1`s. – GZ0 Jul 31 '19 at 17:32
  • point taken, I was absent-minded there @@, you would need an extra step to make `filter()` work. Anyway I have updated my answer with the timings including your approach. It comes out as pretty much the same as `all_in_by_row_set()` which is what I proposed in my original answer. – norok2 Jul 31 '19 at 17:52
  • @norok2 The `map` approach is around 10% faster in my tests as long as the # of matched rows is not very small (e.g., over 5% of the total rows examined). As already mentioned in the SO link you provided, `map` is often faster when the exact same function is called in the generator expression. Actually the primary reason why `map` is faster is that it avoids the repeated lookup of the called function. – GZ0 Jul 31 '19 at 18:07