Two Dimensional Pattern Matching with Wildcards in Numpy Arrays

Question

First of, I'm aware of this, this and this post about two-dimensional pattern matching, these posts however all don't include wildcard matching. Furthermore, I know there are several papers such as this one that have solved the issue I'm facing. However, I'm only just getting familiar with pattern matching in two dimensional arrays and trying to implement the algorithms in the papers is non-trivial for me at the moment.

Thus, the following problem is what I'm facing.

Given a two dimensional array:

[[1 3 4 5 3 3]
 [4 1 4 5 5 2]
 [5 4 3 4 4 2] # pattern
 [4 5 3 4 1 3] # pattern
 [3 3 3 4 4 4] # pattern
 [3 4 3 4 2 5] # pattern
 [4 5 3 4 1 2] # pattern
 [5 1 1 2 4 2]
 [2 1 3 2 1 5]
 [4 4 1 3 3 1]
 [1 4 3 4 4 1]
 [5 2 4 4 4 1]]

And the following example pattern (where a ? would indicate the wildcard match):

[[? ? 3 4 ? ?]
 [? ? 3 4 ? ?]
 [3 3 3 4 4 4]
 [? ? 3 4 ? ?]
 [? ? 3 4 ? ?]]

How would a function be written that takes in the two dimensional array and the pattern, returning True if the pattern is present in the array or False if it is not?

If possible a generalized solution to this problem would be highly appreciated, as there are numerous distinct patterns that I'm trying to match. If required I'm more than willing to provide additional examples.

Does the pattern always occupy all columns? So you could, in theory, just test the pattern against `arr[0:5,:]`, then against `arr[1:6,:]` and so on? — hpaulj, Aug 29 '18 at 22:46
@hpaulj There are various patterns that don't occupy all columns no, this is just one example of a pattern that happens to do so. I could add additional examples to my original post if that would help explain more clearly the issue I have? — Menno Van Dijk, Aug 29 '18 at 22:52
Keep in mind that code that works nicely with compiled languages might not makes optimal use of `numpy` array processing. Image convolution code in `scipy` might be closest you'll get with commonly used packages, https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.ndimage.filters.convolve.html — hpaulj, Aug 30 '18 at 00:02
What is the input format of the "pattern"? Also @DanielMesejo 's question about zero is important. — Daniel F, Aug 30 '18 at 06:18
@DanielMesejo The range of values goes from 1 to 7, it does not include 0 values. — Menno Van Dijk, Aug 30 '18 at 08:23
@DanielF The input format of the pattern, in this case, will always be a (12,6) two-dimensional numpy array filled with integers between the range 1 to 7. The pattern will always be a (M,N) two-dimensional numpy array filled with integers between the range 1 to 7 and wildcards. The pattern can thus also have different dimensions on each axis. — Menno Van Dijk, Aug 30 '18 at 09:01
@DanielF Wildcards can take on any value (and will hence match any value present at the index the wildcard is present). I've used "?" to denote these wildcards, sometimes these get denote them as "*", though. — Menno Van Dijk, Aug 30 '18 at 09:22
In the `pattern`. It's an array right? What are you using to signify a wildcard in the *actual program*? You don't just use a symbol do you? Because then you're going to make us needlessly parse string arrays. — Daniel F, Aug 30 '18 at 09:26
@DanielF Both the pattern and the original are numpy arrays in my case. If I knew the answer to your question, however, I could probably figure out how to solve the above problem myself. I don't know whether using symbols is actually the way to go to match any value within an array. — Menno Van Dijk, Aug 30 '18 at 09:31
OK, you're thinking of it like a string. That's . . . not really how it works in `numpy`, and that's why folks are confused. There's no special wildcard symbol like you'd get in `regex`, but you can do the same thing mathematically. — Daniel F, Aug 30 '18 at 09:56
@DanielF Well yeah, if it was an array of strings then the wildcard would work like the above, I could theoretically convert all integers to strings if that would mitigate that issue? And how would you do the same thing mathematically? That's essentially what I'm asking, as there are several posts about exact pattern matching out there already. — Menno Van Dijk, Aug 30 '18 at 10:04

score 1 · Answer 1 · answered Aug 30 '18 at 10:25

Since your searching space if very small we don't have to worry about memory errors by just unravelling a windowed view.

First you need a mask of where there are values in your pattern

mask

array([[False, False,  True,  True, False, False],
       [False, False,  True,  True, False, False],
       [ True,  True,  True,  True,  True,  True],
       [ True,  True,  True,  True,  True,  True],
       [False, False,  True,  True, False, False],
       [False, False,  True,  True, False, False]], dtype=bool)

Then you need an array of what the values are in those positions:

val = np.array([ 3.,  4.,  3.,  4.,  3.,  3.,  3.,  4.,  4.,  4.,
                 3.,  3.,  3.,  4.,  4.,  4.,  3.,  4.,  3.,  4.])

Then you need a sliding window across your input. Easiest implementation to use is skimage.util.view_as_windows but you can use my pure numpy implementatation here

windows = skimage.util.view_as_windows(input, pattern.shape)
# or
windows = window_nd(input, pattern.shape)

Now, normally doing windows[mask] here would be dangerous - it can create a huge array if you're convolving over many windows. But sice the most windows we could ever have here is 12*6 = 72, we don't have to worry about that.

loc = np.where(np.all(window[mask] == val, axis = -1))

Now loc is the coordinates of the upper left corner of the matching windows.

Or it should be. Maybe provide a test case that can be copy/pasted into an interpreter?

score 1 · Accepted Answer · answered Aug 30 '18 at 10:52

This function takes an input_array, a pattern and a function that lets you identify the wildcard. Here I have used np.nan as wildcard but it could be anything, giving that you can make your own wildcard_function.

It works for arrays of any dimension (1 or more). I have tested it for your example and it seems ok.

from itertools import product
import numpy as np


def match_pattern(input_array, pattern, wildcard_function=np.isnan):

    pattern_shape = pattern.shape
    input_shape = input_array.shape

    is_wildcard = wildcard_function(pattern) # This gets a boolean N-dim array

    if len(pattern_shape) != len(input_shape):
        raise ValueError("Input array and pattern must have the same dimension")

    shape_difference = [i_s - p_s for i_s, p_s in zip(input_shape, pattern_shape)]

    if any((diff < -1 for diff in shape_difference)):
        raise ValueError("Input array cannot be smaller than pattern in any dimension")

    dimension_iterators = [range(0, s_diff + 1) for s_diff in shape_difference]

    # This loop will iterate over every possible "window" given the shape of the pattern
    for start_indexes in product(*dimension_iterators):
        range_indexes = [slice(start_i, start_i + p_s) for start_i, p_s in zip(start_indexes, pattern_shape)]
        input_match_candidate = input_array[range_indexes]

        # This checks that for the current "window" - the candidate - every element is equal 
        #  to the pattern OR the element in the pattern is a wildcard
        if np.all(
            np.logical_or(
                is_wildcard, (input_match_candidate == pattern)
            )
        ):
            return True

    return False

I've tested this for various different patterns and it seems to work exactly as expected. Appreciate the answer! — Menno Van Dijk, Aug 30 '18 at 11:05

Two Dimensional Pattern Matching with Wildcards in Numpy Arrays

2 Answers2