Detecting patterns from two arrays of data in Python

Question

I'm trying to detect patterns from open-high-low-close (OHLC) data, so here is what I did:

Find local minima and maxima on the dataset
Normalize my data by converting the array of local minima and maxima to an array of numbers, where every number is the variation from the previous point.

Until now, everything works, but I got stuck on the following part. I defined an array of data, which is a pattern, that when is plotted on a chart will have a certain shape. I'm now trying to find, on other datasets, shapes that are similar to the pattern I specified.

Here is the pattern specified by me:

Pattern = [7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172]

And here is a sample dataset:

SampleTarget = [-2.2538552787663173, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 2.6012891344383067]

I'm looking for a way to detect when, at a certain point, on SampleTarget, is spotted a series of values that are similar to Pattern.

In this case, for example, I need to detect, somehow, that there is a part of SampleTarget where the values are similar to Pattern, since it's the same dataset from which i extracted Pattern.

What I tried:

I've been suggested to use numpy.correlate, python-dtw (Dynamic time warping), or stumpy but the problem I encountered with those is the lack of practical examples on this particular matter.

Han-Kwang Nienhuys · Accepted Answer · 2020-07-16T12:14:42.160

Here is a trick to do it:

import numpy as np
pat = np.array(Pattern)
data = np.array(SampleTarget)
n = len(data)
m = len(pat)
k = data.strides[0] # typically 8 for float64

# data2d is a view to the original data,
# with data_2d[:-m, 6] == data_2d[1:1-m, 5] == ... == data_2d[6:, 0]
data_2d = np.lib.stride_tricks.as_strided(data, shape=(n-m+1, m), strides=(k, k))

# So you can check for matches on data[i, :] for all i
print(np.all(np.isclose(data_2d, pat), axis=1))

Output:

array([False, False, False, False, False, False, False,  True, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False])

You can use np.where or np.argwhere to get the index of the match(es). You can tune the atol and rtol parameters of np.isclose to set the threshold for an approximate match.

Clarification: if you do the as_strided trick on data=np.arange(30), then data2d will be:

array([[ 0,  1,  2,  3,  4,  5,  6],
       [ 1,  2,  3,  4,  5,  6,  7],
       [ 2,  3,  4,  5,  6,  7,  8],
       ...
       [21, 22, 23, 24, 25, 26, 27],
       [22, 23, 24, 25, 26, 27, 28],
       [23, 24, 25, 26, 27, 28, 29]])

EDIT: This is an efficient way to create a view of the same data with a sliding windows, without requiring extra memory. A numpy array lookup a[i, j] finds the memory address as start_address + a.strides[0]*i + a.strides[1]*j; by setting strides to (8, 8), where 8 is the size of a float value, you achieve the sliding-window effect. Because different array elements refer to the same memory, it's best to treat an array constructed this way as read-only.

EDIT: if you want to have a "score" metric for the quality of the match, you can for example do this:

>>> np.linalg.norm(data_2d - pat, axis=1) 

array([17.5, 17.4, 13.3, 20.5, 12.9, 14.9, 19.7,  0. , 17.4, 13.8, 16.9,
       13.7, 19. , 10.3, 18.3, 15.2, 10.9, 22.3, 13. , 21.8, 15.2, 24.5,
       14.9, 20.7])
# (numbers rounded to reduce clutter)

closer to zero means a better match. Here, norm takes the length of the difference vector d=data-pat, i.e., sqrt(d[0]**2 + ... + d[m-1]**2).

EDIT: If you are interested in patterns that have the same shape, but are scaled to a larger or smaller value, you can do this:

# New dataset with two occurrences of the pattern: one scaled by a factor 1.1,
# one scaled 0.5 with a bit of noise added
data_mod = data*1.1
np.random.seed(1)
data_mod[16:16+m] = pat*0.5 + np.random.uniform(-0.5, 0.5, size=m)
data_2d_mod = np.lib.stride_tricks.as_strided(
    data_mod, shape=(n-m+1, m), strides=(k, k))

# pat_inv: pseudoinverse of pat vector
pat_inv = 1/(pat @ pat) * pat 

# cofs: fit coefficients, shape (n1,)
cofs = data_2d_mod @ pat_inv # fit coefficients, shape (n1,)

# sum of squared residuals, shape (n1,) - zero means perfect fit
ssqr = ((data_2d_mod - cofs.reshape(-1, 1) * pat)**2).sum(axis=1)

print(f'cofs:\n{np.around(cofs, 2)}')
print(f'ssqr:\n{np.around(ssqr, 1)}')

Result:

cofs:
[-0.38 -0.14  0.4  -0.54  0.59  0.36 -0.48  1.1  -0.33  0.12 -0.06  0.18
 -0.21  0.23  0.22 -0.33  0.52 -0.2   0.22 -0.35  0.6  -0.91  0.92  0.01]
ssqr:
[ 81.6 161.8 147.4 155.1 167.3 196.1 138.6   0.   97.8 103.5  85.9  59.3
  57.1  54.9  58.3  29.2   0.7 198.7 217.4 201.9 266.3 235.1 242.8 361.9]

You see that cofs[7] == 1.1, meaning that the pattern had to be scaled by a factor 1.1 on the corresponding data window for a best fit. The fit was perfect, which you can see from ssqr[7] == 0. It also finds the other one, with cofs[16] == 0.52 (close to the expected 0.5 value) and ssqr[16] == 0.7.

Other example: cofs[21]==-0.91 and ssqr[12]==235.1. This means that data_mod[12:19] somewhat resembles the pattern, but inverted (positive and negative swapped). It depends on what you want to do with the data; most likely you'd like to look at cofs values in the range 0.5 to 2: your search pattern is allowed to occur in the data a factor 2 larger or smaller. This should be combined with sufficiently small ssqr values.

Here you see the three potential matches in a graph:

If you use ssqr as a score metric, be aware that a series of zeros in the input will result in cofs=0 and ssqr=0.

Consider using np.sqrt(ssqr/m)/np.abs(cofs) as a metric instead, for two reasons. (1) it will match according to relative error and result in NaN values in the case of zero input. (2) it is more intuitive; if the value is 0.5, it means that the data points deviate by about 0.5 from the pattern values. Here is are the values for this metric, using the same example data:

[ 9.1  35.3  11.6  8.8   8.3  14.8   9.4   0.  11.4  33.3 55.9  16.4
 13.9  12.1  12.9  6.2   0.6  27.2  25.4 15.2  10.4  6.4   6.4 482.5]

For the match at data_mod[21:28], the difference metric is 6.4, which corresponds roughly to the differences as seen in the plot.

Thank you a lot for your answer. So in this case it works when the pattern matches exactly the same; but what happens when a pattern is found but it doesn't match exactly the same? Is there any way to get sort of a "score", where the highest/lowest the score the pattern matches more? — Jack022, Jul 12 '20 at 20:58
I updated the answer two days ago with a score metric; now you have added a bounty to the question; what is missing from my answer? — Han-Kwang Nienhuys, Jul 14 '20 at 22:20
I'm sorry! I haven't been able to check SO and i didn't see. I just opened the bounty to see if there are other approaches too since I've been suggested to look into machine learning algorithms. If you don't mind, can you explain shortly what do those three numpy functions do? — Jack022, Jul 14 '20 at 23:07
Answer updated. Not sure which "three numpy functions" you're referring to, but I added explanation for `as_strided` and `norm`. Your example data `Pattern` and `SampleTarget` have only one (and exact) match. If you hid other "similar patterns" in `SampleTarget`, I'd like to know. Machine Learning may be useful in the greater context of what you're doing, but it is not going to help for the specific question that you have asked here. — Han-Kwang Nienhuys, Jul 15 '20 at 10:22
Thank you a lot for your edits. I tried the last snippet of code but i got the following error: could not broadcast input array from shape (4) into shape (0). I tried this with Pattern = [10, -3, 1, 0.5] — Jack022, Jul 15 '20 at 19:32
I made a pastebin of the snippet i tried, i might be forgetting something https://pastebin.com/HM1gFnVq — Jack022, Jul 15 '20 at 22:04
I am using Numpy 1.18.5, i just tried again and it works now. I just have some final questions: what is the difference between using cofs and ssqr as my score metric? And why are some values negative on cofs? And if a good match on my score metric is at index 3, for example, it means that the similar pattern will be at index 3 on SampleTarget? — Jack022, Jul 16 '20 at 09:28
Thats great, still i dont fully understand the last part of the code: Do i have to set the value manually, in which i want to scale the pattern? Or was this just for representational prupose? And also i dont get the ```numpy.lib.stride_tricks.as_strided``` part, what is is doing? And wouldnt be there a better alternative according to the documentation: https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html — BenjaminDiLorenzo, Oct 07 '21 at 06:37
@Han-KwangNienhuys what if we have a data with offset( u can image two sinusoidal, one has offset value like 3) and shifted a bit. Example: https://ibb.co/VJc4sDR What can we do to improve output of this algorithm to detect this pattern in a given window. Btw, magnitudes are also distinctive in my case when compared to rest of the data. So there is a spike like in the above link. — asevindik, Jan 24 '22 at 22:22

slaw · Answer 2 · 2020-08-12T00:54:48.703

To find a known pattern, Q, from an independent time series, T, with the STUMPY Python package you'll need to do something like this:

from stumpy.core import mass
import numpy as np

Pattern = np.array([7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172])

SampleTarget = np.array([-2.2538552787663173, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 2.6012891344383067])

distance_profile = mass(Pattern, SampleTarget)

# Output of `distance_profile`
array([4.55219811, 4.21544139, 3.29336127, 4.72614564, 2.94202855,
       3.33790488, 4.62672866, 0.        , 4.51937582, 3.47144433,
       4.17966567, 3.26871969, 4.72146046, 2.53070957, 4.46398626,
       3.64503919, 2.64282983, 4.81577841, 2.69799924, 4.64286098,
       2.67446216, 4.52739326, 2.54663088, 3.79885921])

Essentially, the mass function computes a distance_profile by taking your Pattern and sliding a window (that is the same length as your Pattern) along your SampleTarget and calculating the z-normalized Euclidean distance. Each "windowis referred to as a subsequence and each element of thedistance_profilecorresponds to the distance between one subsequence and yourPattern`.

So, for instance, the distance between your Pattern and the first subsequence, SampleTarget[0:0+len(Pattern)], is distance_profile[0] = 4.55219811.

Similarly, the distance between your Pattern and the first subsequence, SampleTarget[1:1+len(Pattern)], is distance_profile[1] = 4.21544139.

And, generally, the distance between your Pattern and the ith subsequence, SampleTarget[i:i+len(Pattern)], is distance_profile[i].

Now, to find the parts of SampleTarget that are "closest" to Pattern, you can look for the smallest values in your distance_profile and then use the corresponding index from your distance_profile to cross reference the index from your SampleTarget.

More concretely, using our example from above, the smallest value found in distance_profile is 0 (a perfect match) and this is found at index i = 7. So, now you should find that SampleTarget[7:7+len(Pattern)] should be identical to Pattern. Note that STUMPY (and mass) doesn't care whether or not an identical match exists. What you'll likely want to do is decide on a reasonable distance threshold/cutoff and examine all "matches" that fall below this distance threshold. Anecdotally/statically, I recommend choosing a threshold that is below np.mean(distance_profile) - 2 * np.std(distance_profile) as a reasonably informed starting point.

Finally, one final note that the mass function computes the sliding window distances in O(nlogn) (the log is base 2) while a naive sliding window computes the distance profile in O(nm) (where m is the length of your pattern). So, for m > 20, mass will always be faster but the performance difference is essentially imperceptible for shorter patterns. And in case anybody wants to debate this, please keep in mind that mass is JIT-compiled and so the first time the function is called it will be "slow" due to the fact that the function needs to be compiled but it should be very fast thereafter.

First of all, i'm sorry for the very late answer! Second, thank you a lot for taking your time to explain; i really hope that this answer wil be useful to anyone else having my same problem. Third, this question was made when i already knew about STUMPY but before i knew about MASS, so you should be pleased to know that i'm using MASS for this task, actually i also opened some issues on your github repository :) — Jack022, Aug 18 '20 at 10:22

score 1 · Answer 3 · edited Jul 23 '20 at 04:24

Here's a rather improvised solution that assumes that you are looking for an exact match, its just brute-forcing match checks by iterating over the entire list, if it finds a match it checks the next pos and so on so forth. It also assumes Pattern[0] is not repeated within the Pattern list however that could easily be coded out with a bit more bedazzling

for i in range(len(SampleTarget)):
    # Iterate over the list and check if the number matchs the first
    # one we are checking agaisnt for our pattern
    if SampleTarget[i] == Pattern[0]:
        # Hey this index might be the start of our pattern,
        # lets check to see if the following items are our pattern
        startIndex = i
        for x in range(len(Pattern)):
            curCheck = startIndex + x # Get current place to check agaisnt

            if SampleTarget[curCheck] != Pattern[x]:
                # Disregard the loop, this isnt it
                break

        # Hey, we made it to the end of the break, so it matches
        # Lets print the index where we found the match
        print(f"Found a pattern match in the sample!\nStart Index: {startIndex}\nEnd Index: {curCheck}")

Heres my take on one that matches nonexact values, within a given tolerance. Feel free to change this as wanted however it is currently at 0.005, and you read about it here

import math

for i in range(len(SampleTarget)):
    if math.isclose(SampleTarget[i], Pattern[0], abs_tol=0.005):
        startIndex = i
        for x in range(len(Pattern)):
            curCheck = startIndex + x

            if not math.isclose(SampleTarget[curCheck], Pattern[x], abs_tol=0.005):
                break

        print(f"Found a pattern match in the sample!\nStart Index: {startIndex}\nEnd Index: {curCheck}")

And both will output the same thing, just the second does not check equality and rather checks on a similar basis rather then absolute.

Hope this helps! Despite you mentioning things and then I pulled out for loops instead hahaha

Thank you a lot! This is actually a very interesting solution. The problem is that i won't be looking for a exactly similar pattern, but when i'll check my pattern against different datasets, i will need to look for the most similar ones — Jack022, Jul 12 '20 at 14:04
I added an edit for checking against similar values rather than absolute ones @Jack022 — Ethan M-H, Jul 12 '20 at 14:08
This could work! How would it handle if, for example, two patterns are found in two different parts? — Jack022, Jul 12 '20 at 14:47
This should find `all` parts in your SampleTarget that match Pattern. And it will then show the indexes for said pattern matchs — Ethan M-H, Jul 13 '20 at 01:11
Please check my solution that is a fuzzy matching method to solve the approximate polygon matching problem, with a scoring method — Akshay Sehgal, Jul 15 '20 at 04:52

score 1 · Answer 4 · edited Jul 15 '20 at 10:33

The problem you are trying to solve is an approximate sub-sequence matching problem (or a fuzzy polygon matching).

This problem can be solved with Levenstein distance. Lets assume -

Pattern = [7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172]
SampleTarget = [-2.2538552787663173, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 2.6012891344383067]

x0 = np.arange(len(SampleTarget))
x1 = np.arange(len(Pattern))
plt.plot(x0,SampleTarget)
plt.plot(x1,Pattern)

You are trying to match the Pattern to the SampleTarget by 'rolling' it over the axis. Basically you need to find a score that tells you how 'distant' is the pattern shape between the Pattern the window of SampleTarget it covers. This can be done via EDIT DISTANCE or LEVENSTEIN DISTANCE. Which intuitively is just -

What is the number of edits I need to change a specific sequence to another.

#!pip install Distance
import distance

score = []
for i in range(len(SampleTarget)):
    SampleTarget_sub = SampleTarget[i:i+len(Pattern)] #rolling the Pattern over windows of SampleTarget
    score.append(distance.levenshtein(Pattern, SampleTarget_sub))
    
print(score)

[7, 7, 7, 7, 6, 4, 2, 0, 2, 4, 6, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7]

This tells you that at 0th window position you need 7 edits to change the Pattern into the subsequence of SampleTarget and at the 7th position, the distance between Pattern and SampleTarget subsequence is 0, meaning it needs 0 edits to change Pattern to the SampleTarget subsequence at the 7th position, meaning exact match.

x2 = np.arange(start = np.argmin(score),stop= np.argmin(score)+len(Pattern))
plt.plot(x0,SampleTarget)
plt.plot(x2,Pattern)

Now let's say that the patterns are NOT the exact match and have some points in the middle that don't actually match correctly.

#modified a value in pattern
Pattern = [7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 4.098092643051778, -0.5337603416066172]
SampleTarget = [-2.2538552787663173, -3.00364077669902, 2.533625273694082, -2.2574740695546116, 3.027465667915112, 6.4222962738564, -2.647309991460278, 7.602339181286544, 3.5054347826086927, -5.198214754528746, 4.7078371642204315, -2.9357312880190425, 2.098092643051778, -0.5337603416066172, 4.212503353903944, -2.600411946446969, 8.511763150938416, -3.775883069427527, 1.8227848101265856, 3.6300348085529524, -1.4635316698656395, 5.527148770392016, -1.476695892939546, 12.248243559718961, -4.443980805341117, 1.9213973799126631, -9.061696658097686, 5.347467608951697, -2.8622540250447197, 2.6012891344383067]

Running the code again, the scores I get are -

[7, 7, 7, 7, 6, 4, 3, 1, 3, 5, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7]

This still corresponds to moving the sequence to the 7th as its the minimum distance from the original Pattern

If you have too much jitteriness in the sequence, i would recommend simplifying your sequences using a polygon approximation algorithm such as Ramer–Douglas–Peucker algorithm (RDP). This will result in better results while applying Levenstein distances. There is a python implementation for it as well!

Hope this solves your problem!

Thank you a lot! What happens if, for example, the shape of patterns is the same but the numbers are very different? — Jack022, Jul 16 '20 at 09:13
You can do 2 things then, one, you could use a grid to map values onto just for applying the above method.. and choose the significant digit as a hyperparameter for approximation. So, 2.00395 and 2.00376 both will round to 2.004 just for the approximate matching and that should get you going. — Akshay Sehgal, Jul 16 '20 at 10:45
Second, you can write your own levenstein distance function from scratch and modify it to multiply the number of edits with the difference in the 2 numbers... so if i need 2 edits, then i will take a weighted sum of the difference between the 2 points in question and that will give me a continous number instead of number of digits. — Akshay Sehgal, Jul 16 '20 at 10:45
I would personally recommend the first since that lets you control your approximations. There might be some unique edge cases that may give weird results since information is lost in differences and multiplications. — Akshay Sehgal, Jul 16 '20 at 10:48

Detecting patterns from two arrays of data in Python

4 Answers4

Linked