Regex to find 'good enough' sequences

Question

I'm looking to implement some algorithm to help me match imperfect sequences.

Say I have a stored sequence of ABBABABBA and I want to find something that 'looks like' that in a large stream of characters.

If I give my algorithm the allowance to have 2 wildcards (differences), how can I use Regex to match something like: where ( and ) mark the differences:

A(A)BABAB(A)A 
or 
(B)BBA(A)ABBA

My Dilemma is that I am looking to find these potential target matches (with imperfections) in a big string of characters. So in something like:

ABBDBABDBCBDBABDB(A(A)BABAB(A)A)DBDBABDBCBDBAB
ADBDBABDBDBDBCBDBABCBDBABCBDBABCBDBABABBBDBABABBCD
DBABCBDABDBABCBCBDBABABDABDBABCBDBABABDDABCBDBABAB

I must be able to search for these 'near enough' matches.
Where brackets denote: (The Good enough Match with the (Differences))

Edit: To be more formal in this example, A match of Length N can be accepted if N-2 characters are the same as the original (2 Differences)

I've used Regex before, but only to find perfect sequences - not for something that 'looks like' one.

Hope this is clear enough to get some advice on. Thanks for reading and any help!

I think the first question to ask is "What qualifies a match?" Does it have to be a certain length, contain specific characters, contain specific sequences within? — 4444, Jun 27 '13 at 13:11
@Doc In the case of getting these matches in this example - a match is something that mostly resembles the original - BUT has allowance for 2 incorrect characters (wildcards). — Rodney, Jun 27 '13 at 13:14
Could you please be more formal and precise? One example of more formal definition would be: a target sequence of size N matches if at least N-2 characters match. Another formal definition would be: a target sequence of the size from N-2 to N+2 matches if it can be reduced to pattern sequence with no more than 2 editing operations (add/delete/replace). — , Jun 27 '13 at 13:19
@Arkadiy hopefully the Edit is more clear. Basically the target must be more or less the same as the original (N), with the allowance of 2 differences (N-2) (wildcards). — Rodney, Jun 27 '13 at 13:29
@Doc This is just an extremely simple example, in reality I am using 4 unique characters, for the sake of this (ABCD). I currently record a signature, ABCDBD. And then I search character samples where this looks like it could be (with allowances (wildcards) for differences), so AB(B)DBD is a match if 1 wildcard is allowed, as would ABCDB(B). With 2 Wildcards, A(A)(B)DBD etc. — Rodney, Jun 27 '13 at 13:35
@Rodney, Excellent: I think my example should work, regardless of how many individual characters there can be. You'll just need to specify how many Wildcards are allowed when you check the Differences count. — 4444, Jun 27 '13 at 13:43
Sorry for this, I've updated the original post to show the extent of the problem I'm facing, thanks for the help guys! — Rodney, Jun 27 '13 at 13:58

Bob Vale · Accepted Answer · 2013-06-27T15:24:09.707

You could use LINQ to be nice and expressive.

In order to use this make sure you have a using System.Linq at the top of your code.

Assuming that

source is the stored target pattern
test is the string to test.

Then you can do

public static bool IsValid(string source, string test) 
{
  return test != null  
         && source != null 
         && test.Length == source.Length 
         && test.Where((x,i) => source[i] != x).Count() <=2
}

There is also a shortcut version that exits false the moment it fails, saving iterating the rest of the string.

public static bool IsValid(string source, string test) 
{
   return test != null  
          && source != null 
          && test.Length == source.Length 
          && !test.Where((x,i) => source[i] != x).Skip(2).Any();
}

As requested in comments, a little explanation of how this works

in C# a string can be treated as an array of characters, which means that the Linq methods can be used on it.

test.Where((x,i) => source[i] != x)

This uses the overload of Where that for each character in test, x gets assigned to the character and i gets assigned to the index. If the condition character at position i in source is not equal to x then output into the result.

Skip(2)

this skips the first 2 results.

Any()

this returns true if there any results left or false if not. Because linq defers execution the moment that this is false the function exits rather than evaluating the rest of the string.

The entire test is then negated by prefixing with a '!' to indicate we want to know where there are no more results.

Now in order to match as substring you are going to need to behave similar to a regex backtracking...

public static IEnumerable<int> GetMatches(string source, string test)
{
   return from i in Enumerable.Range(0,test.Length - source.Length)
      where IsValid(source, !test.Skip(i).Take(source.Length))
          select i;
}

public static bool IsValid(string source, IEnumerable<char> test) 
{
   return test.Where((x,i) => source[i] != x).Skip(2).Any();
}

UPDATE Explained

Enumerable.Range(0,test.Length - source.Length)

This creates a sequence of numbers from 0 to test.Length - source.Length, there is no need in checking starting at every char in test because once the length is shorter the answer is invalid.

from i in ....

Basically iterate over the collection assigning i to be the current value each time

where IsValid(source, !test.Skip(i).Take(source.Length))

Filter the results to only include the ones where there is a match in test starting at index i (hence the skip) and going on for source.Length chars (hence the take.

select i

return i

This returns an enumerable over the indexes in test where there is a match, you could extract them with

GetMatches(source,test).Select(i => 
                      new string(test.Skip(i).Take(source.Length).ToArray()));

Thanks for the answer - I'm not that clued up with LINQ unfortunately although I've seen many fine uses of it. Could you elaborate on what this means a little please? it resembles a neater version of an idea I've just mocked up on a white board — Rodney, Jun 27 '13 at 15:06

score 2 · Answer 2 · answered Jun 27 '13 at 13:18

I don't think this can be done with regexes (if it can, I'm unfamiliar with the syntax). However, you can use the dynamic programming algorithm for Levenshtein distance.

Edit: If you don't need to handle letters that have switched positions, a much easier approach is to just compare each pair of characters from the two strings, and just count the number of differences.

score 2 · Answer 3 · answered Jun 27 '13 at 13:40

2

I can't think how you'd do it with regex but it should be pretty simple to code.

I'd probably just split the strings up and compare them character by character. If you get a difference count it and move to the next character. If you exceed 2 differences then move on to the next full string.

answered Jun 27 '13 at 13:40

ydaetskcoR

53,225
8
158
177

brings up a good point, and I failed to mention this in my post: If the strings are very long and time is of the essence, you can write code within the loop that will quit checking after Differences reaches 2. – 4444 Jun 27 '13 at 13:53

4444 · Answer 4 · 2016-06-27T16:25:29.140

I don't think there's a good regular expression to handle this case. (Or at least, there isn't one that won't take up a good three lines of text and cause multiple bullets in your feet.) However, that doesn't mean you can't solve this problem.

Depending on how large your strings are (I'm assuming they won't be millions of characters each) I don't see anything stopping you from using a single loop to compare individuals character in order, while keeping a tally of differences:

int differences = 0;    // Count of discrepancies you've detected
int tolerance = 7;    // Limit of discrepancies you'll allow

CheckStrings(int differences, int tolerance) {
    for (i = 0; i < StringA.Length; i++)
    {
        if (StringA[i] != StringB[i]) {
            differences++;
            if (differences > tolerance) {
                return false;
            }
        }
    }
    return true;
}

Most of the time, don't be concerned about your strings being too long to put into a loop. Behind-the-scenes, any code that assesses every character of a string will loop in some form or another. Until you literally have millions of characters to deal with, a loop should do the trick just fine.

Yeah basically this. I was 4 seconds quicker but you brought code. I'll delete my answer. — ydaetskcoR, Jun 27 '13 at 13:42

score 1 · Answer 5 · edited May 23 '17 at 12:01

I'll bypass the 'regex' part and focus on:

Is there a better way than doing nested loops to wildcard every position?

It sounds like there's a programmatic way that might help you. See this post about iterating over two IEnumerables. By iterating over both strings at the same time, you can complete the task in O(n) time. Even better, if you know your tolerance(maximum of 2 errors), you can sometimes finish faster than O(n).

Here's a simple example that I wrote up. It probably needs tweaking for your own case, but it might be a good starting point.

static void imperfectMatch(String original, String testCase, int tolerance)
{
    int mistakes = 0;

    if (original.Length == testCase.Length)
    {
        using (CharEnumerator enumerator1 = original.GetEnumerator())
        using (CharEnumerator enumerator2 = testCase.GetEnumerator())
        {
            while (enumerator1.MoveNext() && enumerator2.MoveNext())
            {
                if (mistakes >= tolerance)
                    break;
                if (enumerator1.Current != enumerator2.Current)
                    mistakes++;
            }
        }
    }
    else
        mistakes = -1;

    Console.WriteLine(String.Format("Original String: {0}", original));
    Console.WriteLine(String.Format("Test Case String: {0}", testCase));
    Console.WriteLine(String.Format("Number of errors: {0}", mistakes));
    Console.WriteLine();
}

Thanks, this could actually go someway to helping part of the problem. I've elaborated a little since. — Rodney, Jun 27 '13 at 14:00
@Rodney One thing I can think of is to make an array of Boolean to denote which characters are allowed to be different. Since that array would be equal to the length of both the original and testCase strings, you could iterate over that at the same time as well. In fact, you can iterate over as many equal length arrays/collections as you want. Just make sure to increment each one in the loop structure. — Shaz, Jun 27 '13 at 14:18

score 0 · Answer 6 · answered Jun 27 '13 at 13:18

0

Does any combination of A, B, ( and ) work?

bool isMatch = Regex.IsMatch(inputString, "^[AB()]+$")

answered Jun 27 '13 at 13:18

ic3b3rg

14,629
4
30
53

score 0 · Answer 7 · answered Jun 27 '13 at 13:51

0

For sufficiently small patterns (ABCD), you could generate a regexp:

..CD|.B.D|.BC.|A..D|A.C.|AB..

You could also code a custom comparison loop

answered Jun 27 '13 at 13:51

Regex to find 'good enough' sequences

7 Answers7