How can I check if one two-dimensional NumPy array contains a specific pattern of values inside it?

Question

I have a large NumPy.array field_array and a smaller array match_array, both consisting of int values. Using the following example, how can I check if any match_array-shaped segment of field_array contains values that exactly correspond to the ones in match_array?

import numpy
raw_field = ( 24,  25,  26,  27,  28,  29,  30,  31,  23, \
              33,  34,  35,  36,  37,  38,  39,  40,  32, \
             -39, -38, -37, -36, -35, -34, -33, -32, -40, \
             -30, -29, -28, -27, -26, -25, -24, -23, -31, \
             -21, -20, -19, -18, -17, -16, -15, -14, -22, \
             -12, -11, -10,  -9,  -8,  -7,  -6,  -5, -13, \
              -3,  -2,  -1,   0,   1,   2,   3,   4,  -4, \
               6,   7,   8,   4,   5,   6,   7,  13,   5, \
              15,  16,  17,   8,   9,  10,  11,  22,  14)
field_array = numpy.array(raw_field, int).reshape(9,9)
match_array = numpy.arange(12).reshape(3,4)

These examples ought to return True since the pattern described by match_array aligns over [6:9,3:7].

It's probably worth pointing out that the x and y coordinates in `raw_field` as-presented end up being transposed in the alignment slice I've noted... — Augusta, Sep 11 '15 at 21:30

score 13 · Accepted Answer · edited May 23 '17 at 11:51

Approach #1

This approach derives from a solution to Implement Matlab's im2col 'sliding' in python that was designed to rearrange sliding blocks from a 2D array into columns. Thus, to solve our case here, those sliding blocks from field_array could be stacked as columns and compared against column vector version of match_array.

Here's a formal definition of the function for the rearrangement/stacking -

def im2col(A,BLKSZ):   

    # Parameters
    M,N = A.shape
    col_extent = N - BLKSZ[1] + 1
    row_extent = M - BLKSZ[0] + 1

    # Get Starting block indices
    start_idx = np.arange(BLKSZ[0])[:,None]*N + np.arange(BLKSZ[1])

    # Get offsetted indices across the height and width of input array
    offset_idx = np.arange(row_extent)[:,None]*N + np.arange(col_extent)

    # Get all actual indices & index into input array for final output
    return np.take (A,start_idx.ravel()[:,None] + offset_idx.ravel())

To solve our case, here's the implementation based on im2col -

# Get sliding blocks of shape same as match_array from field_array into columns
# Then, compare them with a column vector version of match array.
col_match = im2col(field_array,match_array.shape) == match_array.ravel()[:,None]

# Shape of output array that has field_array compared against a sliding match_array
out_shape = np.asarray(field_array.shape) - np.asarray(match_array.shape) + 1

# Now, see if all elements in a column are ONES and reshape to out_shape. 
# Finally, find the position of TRUE indices
R,C = np.where(col_match.all(0).reshape(out_shape))

The output for the given sample in the question would be -

In [151]: R,C
Out[151]: (array([6]), array([3]))

Approach #2

Given that opencv already has template matching function that does square of differences, you can employ that and look for zero differences, which would be your matching positions. So, if you have access to cv2 (opencv module), the implementation would look something like this -

import cv2
from cv2 import matchTemplate as cv2m

M = cv2m(field_array.astype('uint8'),match_array.astype('uint8'),cv2.TM_SQDIFF)
R,C = np.where(M==0)

giving us -

In [204]: R,C
Out[204]: (array([6]), array([3]))

Benchmarking

This section compares runtimes for all the approaches suggested to solve the question. The credit for the various methods listed in this section goes to their contributors.

Method definitions -

def seek_array(search_in, search_for, return_coords = False):
    si_x, si_y = search_in.shape
    sf_x, sf_y = search_for.shape
    for y in xrange(si_y-sf_y+1):
        for x in xrange(si_x-sf_x+1):
            if numpy.array_equal(search_for, search_in[x:x+sf_x, y:y+sf_y]):
                return (x,y) if return_coords else True
    return None if return_coords else False

def skimage_based(field_array,match_array):
    windows = view_as_windows(field_array, match_array.shape)
    return (windows == match_array).all(axis=(2,3)).nonzero()

def im2col_based(field_array,match_array):   
    col_match = im2col(field_array,match_array.shape)==match_array.ravel()[:,None]
    out_shape = np.asarray(field_array.shape) - np.asarray(match_array.shape) + 1  
    return np.where(col_match.all(0).reshape(out_shape))

def cv2_based(field_array,match_array):
    M = cv2m(field_array.astype('uint8'),match_array.astype('uint8'),cv2.TM_SQDIFF)
    return np.where(M==0)

Runtime tests -

Case # 1 (Sample data from question):

In [11]: field_array
Out[11]: 
array([[ 24,  25,  26,  27,  28,  29,  30,  31,  23],
       [ 33,  34,  35,  36,  37,  38,  39,  40,  32],
       [-39, -38, -37, -36, -35, -34, -33, -32, -40],
       [-30, -29, -28, -27, -26, -25, -24, -23, -31],
       [-21, -20, -19, -18, -17, -16, -15, -14, -22],
       [-12, -11, -10,  -9,  -8,  -7,  -6,  -5, -13],
       [ -3,  -2,  -1,   0,   1,   2,   3,   4,  -4],
       [  6,   7,   8,   4,   5,   6,   7,  13,   5],
       [ 15,  16,  17,   8,   9,  10,  11,  22,  14]])

In [12]: match_array
Out[12]: 
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [13]: %timeit seek_array(field_array, match_array, return_coords = False)
1000 loops, best of 3: 465 µs per loop

In [14]: %timeit skimage_based(field_array,match_array)
10000 loops, best of 3: 97.9 µs per loop

In [15]: %timeit im2col_based(field_array,match_array)
10000 loops, best of 3: 74.3 µs per loop

In [16]: %timeit cv2_based(field_array,match_array)
10000 loops, best of 3: 30 µs per loop

Case #2 (Bigger random data):

In [17]: field_array = np.random.randint(0,4,(256,256))

In [18]: match_array = field_array[100:116,100:116].copy()

In [19]: %timeit seek_array(field_array, match_array, return_coords = False)
1 loops, best of 3: 400 ms per loop

In [20]: %timeit skimage_based(field_array,match_array)
10 loops, best of 3: 54.3 ms per loop

In [21]: %timeit im2col_based(field_array,match_array)
10 loops, best of 3: 125 ms per loop

In [22]: %timeit cv2_based(field_array,match_array)
100 loops, best of 3: 4.08 ms per loop

Out of curiousity, how do these approaches compare with ajcr's `skimage` solution or the block-at-a-time method I posted? (If you don't know or want to test it, that's perfectly understandable. XD) The `cv2` solution looks really interesting, but I don't have the module and don't know anything about it. I'll put it on my List Of Stuff To Check Out, though! — Augusta, Sep 15 '15 at 02:41
@Augusta No worries, added the benchmarking codes and results. Hope this helps! — Divakar, Sep 15 '15 at 08:19
Wow, that's pretty amazing! I was expecting my vanilla check to be slow, but I didn't think there'd be alternatives *that* much faster! Thanks for your results! Looks like `cv2` is going to queue cut a little on that list after all. ;) — Augusta, Sep 15 '15 at 20:13
@Augusta Yup, definitely look into `cv2` and get it installed. I think it's easy to install, not that I remember the steps :) Well good luck and keep learning through SO! — Divakar, Sep 16 '15 at 17:36

Alex Riley · Answer 2 · 2015-09-11T21:31:15.873

There's no such search function built in to NumPy, but it is certainly possible to do in NumPy

As long as your arrays are not too massive*, you could use a rolling window approach:

from skimage.util import view_as_windows

windows = view_as_windows(field_array, match_array.shape)

The function view_as_windows is written purely in NumPy so if you don't have skimage you can always copy the code from here.

Then to see if the sub-array appears in the larger array, you can write:

>>> (windows == match_array).all(axis=(2,3)).any()
True

To find the indices of where the top-left corner of the sub-array matches, you can write:

>>> (windows == match_array).all(axis=(2,3)).nonzero()
(array([6]), array([3]))

This approach should also work for arrays of higher dimensions.

*although the array windows takes up no additional memory (only the strides and shape are changed to create a new view of the data), writing windows == match_array creates a boolean array of size (7, 6, 3, 4) which is 504 bytes of memory. If you're working with very large arrays, this approach might not be feasible.

Augusta · Answer 3 · 2015-09-11T20:29:19.510

One solution is to search the entire search_in array block-at-a-time (a 'block' being a search_for-shaped slice) until either a matching segment is found or the search_for array is exhausted. I can use it to get coordinates for the matching block, or just a bool result by sending True or False for the return_coords optional argument...

def seek_array(search_in, search_for, return_coords = False):
    """Searches for a contiguous instance of a 2d array `search_for` within a larger `search_in` 2d array.
If the optional argument return_coords is True, the xy coordinates of the zeroeth value of the first matching segment of search_in will be returned, or None if there is no matching segment.
If return_coords is False, a boolean will be returned.
 * Both arrays must be sent as two-dimensional!"""
    si_x, si_y = search_in.shape
    sf_x, sf_y = search_for.shape

    for y in xrange(si_y-sf_y+1):
        for x in xrange(si_x-sf_x+1):
            if numpy.array_equal(search_for, search_in[x:x+sf_x, y:y+sf_y]):
                return (x,y) if return_coords else True  # don't forget that coordinates are transposed when viewing NumPy arrays!
    return None if return_coords else False

I wonder if NumPy doesn't already have a function that can do the same thing, though...

It is an answer to the question, because the code functions properly does what I need it to, although perhaps there are more efficient methods. Generally, policy is to resist answering in the question itself, yes? I mean, we have the option of answering our questions straight away for exactly this reason, I thought... — Augusta, Sep 11 '15 at 20:26
I should probably have chosen a different way to begin the answer post, though. I'll edit it now. — Augusta, Sep 11 '15 at 20:28
I just feel like *how-to* questions are more appropriate to SO's format than *better-way* questions, for reason that subjective words like "better" tend to invite trouble in the form of differing opinions. "How can I--?" questions ask only for methods; to that end, I have provided one solution, and I hope others will provide more and better ones. That's all. — Augusta, Sep 11 '15 at 20:47

score 1 · Answer 4 · answered Feb 01 '16 at 22:09

To add to the answers already posted, I'd like to add one that takes into account errors due to floating point precision in case that matrices come from, let's say, image processing for instance, where numbers are subject to floating point operations.

You can recurse the indexes of the larger matrix, searching for the smaller matrix. Then you can extract a submatrix of the larger matrix matching the size of the smaller matrix.

You have a match if the contents of both, the submatrix of 'large' and the 'small' matrix match.

The following example shows how to return the first indexes of the location in the large matrix found to match. It would be trivial to extend this function to return an array of locations found to match if that's the intent.

import numpy as np

def find_submatrix(a, b):
    """ Searches the first instance at which 'b' is a submatrix of 'a', iterates
        rows first. Returns the indexes of a at which 'b' was found, or None if
        'b' is not contained within 'a'"""
    a_rows=a.shape[0]
    a_cols=a.shape[1]

    b_rows=b.shape[0]
    b_cols=b.shape[1]

    row_diff = a_rows - b_rows
    col_diff = a_cols - b_cols

    for idx_row in np.arange(row_diff):
        for idx_col in np.arange(col_diff):
            row_indexes = [idx + idx_row for idx in np.arange(b_rows)]
            col_indexes = [idx + idx_col for idx in np.arange(b_cols)]

            submatrix_indexes = np.ix_(row_indexes, col_indexes)
            a_submatrix = a[submatrix_indexes]

            are_equal = np.allclose(a_submatrix, b)  # allclose is used for floating point numbers, if they
                                                     # are close while comparing, they are considered equal.
                                                     # Useful if your matrices come from operations that produce
                                                     # floating point numbers.
                                                     # You might want to fine tune the parameters to allclose()
            if (are_equal):
                return[idx_col, idx_row]

    return None

Using the function above you can run the following example:

large_mtx = np.array([[1,  2, 3, 7, 4, 2, 6],
                      [4,  5, 6, 2, 1, 3, 11],
                      [10, 4, 2, 1, 3, 7, 6],
                      [4,  2, 1, 3, 7, 6, -3],
                      [5,  6, 2, 1, 3, 11, -1],
                      [0,  0, -1, 5, 4, -1, 2],
                      [10, 4, 2, 1, 3, 7, 6],
                      [10, 4, 2, 1, 3, 7, 6] 
                     ])

# Example 1: An intersection at column 2 and row 1 of large_mtx
small_mtx_1 = np.array([[4, 2], [2,1]])
intersect = find_submatrix(large_mtx, small_mtx_1)
print "Example 1, intersection (col,row): " + str(intersect)

# Example 2: No intersection
small_mtx_2 = np.array([[-14, 2], [2,1]])
intersect = find_submatrix(large_mtx, small_mtx_2)
print "Example 2, intersection (col,row): " + str(intersect)

Which would print:

Example 1, intersection: [1, 2]
Example 2, intersection: None

Thanks or this solution! But is approximation useful for comparing arrays of relatively small integers? (I'm not saying that your answer isn't useful at all, but in the context of numbers less than, say, two hundred.) Also, how does it do in terms of processing efficiency? (This is not me asking you to retest all of the solutions here, just maybe comparing it to the similar[-looking] but really-quite-slow `seek_array` solution..?) For situations where approximation is useful, this would be the obvious pick, but how does it do (speed-wise) for exact comparisons? :D — Augusta, Feb 03 '16 at 21:06
@Augusta Fair enough and thorough question :). I think that if you are using integers, I would get rid of the allclose() and substitute it by np.all(np.equal()), this will reduce the number of comparisons since the code will not be dealing with tolerances. I don't know how it will compare with other solutions proposed here, but this solution can definitely be improved in terms of efficiency performance, since this algorithm is scan-based, trying to match the whole matrix. We can instead pre-index potential comparison points and only at those points compare the full matrix. — Jorge Torres, Feb 03 '16 at 21:48
I just ran a profiling by removing the tolerance-based and changing it for integer compasirron. These are the profiling results with the data as provided in your question: `Ran 2000 search cycles in 2.9543030262 [sec] = 0.0014771515131 [sec/cycle]` — Jorge Torres, Feb 03 '16 at 22:00
So definitely the cv2_based approach suggested in Divakar's answer would be preferable — Jorge Torres, Feb 03 '16 at 22:09

score 0 · Answer 5 · answered Nov 13 '18 at 10:31

Here's a solution using the as_strided() function from stride_tricks module

import numpy as np
from numpy.lib.stride_tricks import as_strided

# field_array (I modified it to have two matching arrays)
A = np.array([[ 24,  25,  26,  27,  28,  29,  30,  31,  23],
              [ 33,   0,   1,   2,   3,  38,  39,  40,  32],
              [-39,   4,   5,   6,   7, -34, -33, -32, -40],
              [-30,   8,   9,  10,  11, -25, -24, -23, -31],
              [-21, -20, -19, -18, -17, -16, -15, -14, -22],
              [-12, -11, -10,  -9,  -8,  -7,  -6,  -5, -13],
              [ -3,  -2,  -1,   0,   1,   2,   3,   4,  -4],
              [  6,   7,   8,   4,   5,   6,   7,  13,   5],
              [ 15,  16,  17,   8,   9,  10,  11,  22,  14]])

# match_array
B = np.arange(12).reshape(3,4)


# Window view of A
A_w = as_strided(A, shape=(A.shape[0] - B.shape[0] + 1,
                           A.shape[1] - B.shape[1] + 1,
                           B.shape[0], B.shape[1]),
                    strides=2*A.strides).reshape(-1, B.shape[0], B.shape[1])

match = (A_w == B).all(axis=(1,2))

We can also find the indices of the first element of each matching block in A

where = np.where(match)[0]
ind_flat = where + (B.shape[1] - 1)*(np.floor(where/(A.shape[1] - B.shape[1] + 1)).astype(int))
ind = [tuple(row) for row in np.array(np.unravel_index(ind_flat, A.shape)).T]

Result

print(match.any())
True

print(ind)
[(1, 1), (6, 3)]

How can I check if one two-dimensional NumPy array contains a specific pattern of values inside it?

5 Answers5

Benchmarking

Linked

Related