3

I want to ellaborate a regex that covers the following scenarios:

The searched word is "potato".

It matches if the user searches for "potaot" (He typed quickly and the "o" finger was faster than the "t" finger. (done)

It matches if the user searches for "ptato" (He forgot one letter). (done)

With my knowlege of regex the further I could go was:

 (?=[potato]{5,6})p?o?t?a?t?o?

The problem with this is that it matches reversed words like "otatop", which is a little clever but a little bezarre, and "ooooo", which is totally undesirable. So not I describe what I don't want.

I don't want repeated letters to match "ooooo", "ooopp" and etc. (unable)

By the way I'm using C#.

Diego Alves
  • 2,462
  • 3
  • 32
  • 65
  • Why not string compare with a threshold on minimum "percentage match" required!! – Prateek Shrivastava Mar 20 '18 at 02:16
  • 5
    I suggest you look at string distance algorithms, rather than RegEx for this use case (sometimes called "fuzzy matching"). For example, Jaro-Winkler. Basically, you get a value that measures how similar two strings are and you can determine an appropriate threshold for your particular use case and data. – Jay Mar 20 '18 at 02:17
  • This might help https://stackoverflow.com/questions/4833769/creating-a-spell-check-that-checks-against-a-database-with-a-reasonable-runtim – Matthew Mar 20 '18 at 02:18
  • I suspect regexes are a *very* bad tool for this (and I use them to parse html, so I'm not picky). Maybe someone will prove me wrong. – zzxyz Mar 20 '18 at 02:23
  • Never heard any of those. I may check them out. I am thinking about performing some manipulation on the input string before testing it, remove 3 or more repeated letter and etc. – Diego Alves Mar 20 '18 at 02:23
  • Regexes aren't designed to do what you're attempting. Pick the right tool for the job. (And you've learned a lesson, BTW; with regexes. What it should match is not the only thing that is important; what it should not match matters just as much.) – Ken White Mar 20 '18 at 02:26
  • Not sure if [`\b(?=\w{5,6}\b)p([ot]?)(?!\1)[ot]?([at]?)(?!\2)[at]?([ot]?)(?!\3)[ot]?\b`](https://regex101.com/r/4fSQ3L/1) is good enough. – Wiktor Stribiżew Mar 20 '18 at 08:13
  • I think you should look here http://www.anotherchris.net/csharp/how-to-write-a-spelling-corrector-in-csharp/ – Praneet Nadkar Mar 22 '18 at 06:34
  • This can be a very bad idea. I am just curious. What if we convert all the possible combinations like ptato, potato, potat to their respective ascii values like a number.. and then match the user's entered value to the set of numbers ? – Praneet Nadkar Mar 22 '18 at 06:36
  • 5
    This is an "XY" problem. You have a bad solution in mind and you're asking a question about how to implement your bad solution. Regular expressions are not a good tool for this job, but you've asked for a solution that uses this wrong tool. Don't do that. State the problem you really have, and not what tool you think might solve it. – Eric Lippert Mar 22 '18 at 18:23

3 Answers3

7

Don't use regular expressions.

The best solution is the simple one. There are only eleven possible inexact matches, so just enumerate them:

List<string> inexactMatches = new List<string> {
  "otato", "ptato", "poato", "potto", "potao", "potat",
  "optato", "ptoato", "poatto", "pottao", "potaot"};
...
bool hasInexactMatch = inexactMatches.Contains(whatever);

It took less than a minute to type those out; use the easy specific solution rather than trying to do some crazy regular expression that's going to take you hours to find and debug.

If you're going to insist on using a regular expression, here's one that works:

otato|ptato|poato|potto|potao|potat|optato|ptoato|poatto|pottao|potaot

Again: simpler is better.


Now, one might suppose that you want to solve this problem for more words than "potato". In that case, you might have said so -- but regardless, we can come up with some easy solutions.

First, let's enumerate all the strings that have an omission of one letter from a target string. Strings are IEnumerable<char> so let's solve the general problem:

static IEnumerable<T> OmitAt<T>(this IEnumerable<T> items, int i) =>
  items.Take(i).Concat(items.Skip(i + 1));

That's a bit gross enumerating the sequence twice but I'm not going to stress about it. Now let's make a specific version for strings:

static IEnumerable<string> Omits(this string s) =>
  Enumerable
    .Range(0, s.Length)
    .Select(i => new string(s.OmitAt(i).ToArray()));

Great. Now we can say "frob".Omits() and get back rob, fob, frb, fro.

Now let's do the swaps. Again, solve the general problem first:

static void Swap<T>(ref T x, ref T y)
{
    T t = x;
    x = y;
    y = t;
}

static T[] SwapAt<T>(this IEnumerable<T> items, int i)
{
    T[] newItems = items.ToArray();
    Swap(ref newItems[i], ref newItems[i + 1]);
    return newItems;
}

And now we can solve it for strings:

static IEnumerable<string> Swaps(this string s) =>
    Enumerable
      .Range(0, s.Length - 1)
      .Select(i => new string(s.SwapAt(i)));

And now we're done:

string source = "potato";
string target = whatever;
bool match = 
  source.Swaps().Contains(target) || 
  source.Omits().Contains(target);

Easy peasy. Solve general problems using simple, straightforward, correct algorithms that can be composed into larger solutions. None of my algorithms there was more than three lines long and they can easily be seen to be correct.

Eric Lippert
  • 647,829
  • 179
  • 1,238
  • 2,067
2

The weapon of choice here is a similarity (or distance) matching algorithm. Compare similarity algorithms gives a good overview of the most common distance metrics/algorithms.

The problem is that there is no single best metric. The choice depends, e.g. on input type, accuracy requirements, speed, resources availability, etc. Nevertheless, comparing algorithms can be messy.

Two of the most commonly suggested metrics are the Levenshtein distance and Jaro-Winkler:

  • Levenshtein distance, which provides a similarity measure between two strings, is arguably less forgiving, and more intuitive to understand than some other metrics. (There are modified versions of the Levenshtein distance like the Damerau-Levenshtein distance, which includes transpositions, that could be even more appropriate to your use case.)

  • Some claim the Jaro-Winkler, which provides a similarity measure between two strings allowing for character transpositions to a degree adjusting the weighting for common prefixes, distance is "one of the most performant and accurate approximate string matching algorithms currently available [Cohen, et al.], [Winkler]." However, the choice still depends very much on the use case and one cannot draw general conclusions from specific studies, e.g. name-matching Cohen, et al. 2003.

You can find plenty of packages on NuGet that offer you a variety of similarity algorithms (a, b, c), fuzzy matches, phonetic, etc. to add this feature to your site or app.

Fuzzy matching can also be used directly on the database layer. An implementation of the Levenshtein distance can be found for most database systems (e.g. MySQL, SQL Server) or is already built-in (Oracle, PostgreSQL).

Depending on your exact use case, you could also use a cloud-based solution (i.e. use a microservice based on AWS, Azure, etc. or roll-your-own) to get autosuggest-like fuzzy search/matching.

wp78de
  • 18,207
  • 7
  • 43
  • 71
0

It's easiest to do it this way:

static void Main(string[] args)
{
    string correctWord = "Potato";

    string incorrectSwappedWord = "potaot";
    string incorrectOneLetter = "ptato";

    // Returns true
    bool swapped = SwappedLettersMatch(correctWord, incorrectSwappedWord);
    // Returns true
    bool oneLetter = OneLetterOffMatch(correctWord, incorrectOneLetter);
}

public static bool OneLetterOffMatch(string str, string input)
{
    int ndx = 0;

    str = str.ToLower();
    input = input.ToLower();

    if (string.IsNullOrWhiteSpace(str) || string.IsNullOrWhiteSpace(input))
    {
        return false;
    }

    while (ndx < str.Length)
    {
        string newStr = str.Remove(ndx, 1);

        if (input == newStr)
        {
            return true;
        }

        ndx++;
    }
    return false;
}

public static bool SwappedLettersMatch(string str, string input)
{
    if (string.IsNullOrWhiteSpace(str) || string.IsNullOrWhiteSpace(input))
    {
        return false;
    }

    if (str.Length != input.Length)
    {
        return false;
    }

    str = str.ToLower();
    input = input.ToLower();

    int ndx = 0;

    while (ndx < str.Length)
    {
        if (ndx == str.Length - 1)
        {
            return false;
        }
        string newStr = str[ndx + 1].ToString() + str[ndx];
        if (ndx > 0)
        {
            newStr = str.Substring(0, ndx) + newStr;
        }
        if (str.Length > ndx + 2)
        {
            newStr = newStr + str.Substring(ndx + 2);
        }

        if (newStr == input)
        {
            return true;
        }

        ndx++;
    }

    return false;
}

OneLetterOffMatch will return true is there is a match that's off by just one character missing. SwappedLettersMatch will return true is there is a match when just two of the letters are swapped. These functions are case-insenstive, but if you want a case-sensitive version, just remove the calls to .ToLower().

Icemanind
  • 47,519
  • 50
  • 171
  • 296