Finding a non-consecutive recurring pattern inside a stream

Question

I'm trying to figure out a way to detect recurrence of some patterns inside a stream of numbers. Problem is - the pattern elements are not necessarily adjacent.

For example, take the following stream (each letter here denotes a unique number):

A  C  B  D  A  E  B  F  G  A  H  B  I  A  J  B ...

The obvious pattern would be A _ B. Naturally A and B themselves are also patterns, so let's say we want all the patterns above some threshold length - 2 in this case, or all the patterns that are not part of a longer pattern - either definition works for me. Note also that it's not recurring at any fixed rate (i.e. - 2 instances of the pattern can be spaced out between themselves by any number of elements). You might also notice the 'B _ A' and 'A _ B _ A _ B' are also recurring patterns here - these would also be welcome as output.

By the way, this is simplifying things, in reality I can have A and B also appearing outside that pattern, and perhaps in the future i'd like to recognize even A _ _ B (or distances that are within some threshold), but for now i'd settle for the simpler problem stated above.

One more restriction - the possible values are not bounded, sorry about that. This really cripples most of the solutions I could think of. On the bright side however, I'm a little lax on complexity, i'd settle for pretty much anything that's not offensively-exponential (although my poor computer would be much obliged for an efficient solution :))

Had two ideas to approach this -

Keep a "matrix" of values, populate with distances, recognize recurring delta and mark them. Since the region is not bounded, this matrix would have to be a tile of the entire value space (going from lowest to highest seen value). For the sake of space optimization i'm thinking perhaps on multiple matrices, one for each "group" of values, and treat the cross-matrix deltas in another way.
Some kind of convolution/FFT-like approach - match a section of the stream against itself with changing offsets, see how many match in one go. I don't really see how we can fit this here though.

If someone has a better way to approach this, knows an existing algorithm, suspects that the problem in not solvable, or can help me choose between the two options suggested (or find problems with them), i'd be very much obliged. If something is not clear about the constraints or the examples - leave a comment and i'll add it.

Thanks in advance!

EDIT: Adding an option of suffix-tree based solution. Just found out about Ukkonen's algorithm (Ukkonen's suffix tree algorithm in plain English?), trying to decide if it's useful here...

Well, I tried starting with a "Hello" but SO insists on making me rude, my apologies :) — Leeor, Sep 23 '13 at 10:22

score 0 · Answer 1 · answered Sep 23 '13 at 18:47

I think you can use Priority Queue-like structure. This is what I can think of:

While (your input stream has more elements)
    Check if it is 'A'

    if it is 'A':
        insert it in the queue

    // Now you have to look for 'B'
    if current element is NOT 'B'
        insert it in the queue

    // As soon as you get 'B'
    you can empty the queue and you have the pattern

    // If you find 'A' while looking for 'B'
    whatever you have in queue, just discard it, because in a pattern like A..A...B, A..A is not of your interst I believe

You can add additional constraints like minimum length. You can also use LinkedList for this.

Thanks for the answer, but let me clarify - i'm looking for any pattern not just A...B. Also, these are actually numbers, i've used letters to simplify the explanation but as I mentioned - they're unbounded, so there could be any value at any point. — Leeor, Sep 23 '13 at 19:11

Finding a non-consecutive recurring pattern inside a stream

1 Answers1