I'm trying to figure out a way to detect recurrence of some patterns inside a stream of numbers. Problem is - the pattern elements are not necessarily adjacent.
For example, take the following stream (each letter here denotes a unique number):
A C B D A E B F G A H B I A J B ...
The obvious pattern would be A _ B
. Naturally A
and B
themselves are also patterns, so let's say we want all the patterns above some threshold length - 2 in this case, or all the patterns that are not part of a longer pattern - either definition works for me. Note also that it's not recurring at any fixed rate (i.e. - 2 instances of the pattern can be spaced out between themselves by any number of elements). You might also notice the 'B _ A' and 'A _ B _ A _ B' are also recurring patterns here - these would also be welcome as output.
By the way, this is simplifying things, in reality I can have A and B also appearing outside that pattern, and perhaps in the future i'd like to recognize even A _ _ B
(or distances that are within some threshold), but for now i'd settle for the simpler problem stated above.
One more restriction - the possible values are not bounded, sorry about that. This really cripples most of the solutions I could think of. On the bright side however, I'm a little lax on complexity, i'd settle for pretty much anything that's not offensively-exponential (although my poor computer would be much obliged for an efficient solution :))
Had two ideas to approach this -
Keep a "matrix" of values, populate with distances, recognize recurring delta and mark them. Since the region is not bounded, this matrix would have to be a tile of the entire value space (going from lowest to highest seen value). For the sake of space optimization i'm thinking perhaps on multiple matrices, one for each "group" of values, and treat the cross-matrix deltas in another way.
Some kind of convolution/FFT-like approach - match a section of the stream against itself with changing offsets, see how many match in one go. I don't really see how we can fit this here though.
If someone has a better way to approach this, knows an existing algorithm, suspects that the problem in not solvable, or can help me choose between the two options suggested (or find problems with them), i'd be very much obliged. If something is not clear about the constraints or the examples - leave a comment and i'll add it.
Thanks in advance!
EDIT: Adding an option of suffix-tree based solution. Just found out about Ukkonen's algorithm (Ukkonen's suffix tree algorithm in plain English?), trying to decide if it's useful here...