1

How can I detect a typo, but only for the specific phrase. Another way to think about it would be how can I detect a typo for a certain regex.

For example, I do not want a generic typo finder, I found multiple resources on that. I do not want a generic spell checker, again I found multiple resources on that.

How would I write a typo checker for a relatively constant value...say:

Super Secret 13-12345

It should always say "Super Secret NN-NNNNN" (N means any 0-9 number).

It would flag the following as "typos":

  1. Ssuper Secret 13-12345
  2. Super Secret 1312345
  3. Sper Scret 13-123456
  4. Spuer Secret 13-12345
  5. Super Secret
  6. 13-12345

It would NOT flag the following as "typos":

  1. Super Secret 13-12345
  2. Any other random words
  3. Superman flies over the jungle

I am most worried about extra characters leaking in, transposing characters, or numbers not following the NN-NNNNN format.

I feel like this is an answerable question, but I may just not be asking Google or SO using the correct words.

I am writing it in .NET, but could obviously port anything.

bulltorious
  • 7,769
  • 4
  • 49
  • 78
  • 1
    Have a look at this answer: http://stackoverflow.com/questions/10074298/how-to-check-if-two-strings-are-approximately-equal – Fabian Bigler Sep 09 '13 at 14:07
  • Or take a look at 'Levenshtein distance': http://en.wikipedia.org/wiki/Levenshtein_distance – Fabian Bigler Sep 09 '13 at 14:11
  • I don't think this can be done with a regular expression, because regex find patterns but the mistakes that might be committed don't follow any pattern, they are random. – Ibrahim Najjar Sep 09 '13 at 14:12

2 Answers2

1

This isn't a good place for a regex: you would need a regex that detects every possible type of typo. Instead, you should be looking at the Levenshtein distance. It would work something like:

  1. replace all invalid characters with a placeholder, e.g. "!".
  2. replace all numbers with a different placeholder, e.g. "#".
  3. Calculate Levenshtein distance from "Super Secret ##-#####".
  4. If distance is below a certain value, and isn't 0, return true. Otherwise, return false.

Once you have it implemented, play with the threshold in step 4 to match the desired behaviour.

Edit: "Invalid character" can either mean any character other than those in "Superct0123456789-", or it can mean any non-alphanumeric other than "-". The end result should be the same.

RoadieRich
  • 6,330
  • 3
  • 35
  • 52
0

Why not search for your number pattern, and have a lookaround for Super Secret? If it isn't there, you can capture whatever is there and see if it is the spelling error you are looking for? Then you can add a simple replace - or re.sub() - to put the correct spelling in? Now you have to be careful, and build the regex up slowly. There's this thing about lookarounds that aren't fixed length, but I forget now if it's look ahead or look behind that have that issue. There are workarounds if you run into this issue. Make separate capture groups for your number - strict and specific - and another for your phrase - much more flexible, like with ? as a quantifier and character sets [sS] for known possible misspellings.

Malik A. Rumi
  • 1,855
  • 4
  • 25
  • 36