0

Possible Duplicate:
Search for string allowing for one mismatch in any location of the string

I am given a string s and a string t. Is there a regular expression to find all occurrences of t inside s with at most one mismatched character. (At most one character from t is allowed to be substituted by another.)

Community
  • 1
  • 1
Randomblue
  • 112,777
  • 145
  • 353
  • 547
  • Please add some example data to play with, it makes things easier – Robjong Feb 21 '12 at 01:24
  • @Robjong: [This](http://stackoverflow.com/questions/2420412/search-for-string-allowing-for-one-mismatch-in-any-location-of-the-string) is what I'm trying to do. – Randomblue Feb 21 '12 at 01:27
  • i'm no expert on this subject, but you may be able to use a variant of this that allows up to 1 mismatch: http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm – mpen Feb 21 '12 at 01:46
  • @Randomblue - if you can cite another SO question that covers this exact same topic, shouldn't this be closed as a duplicate? I posted a couple of possible solutions in the other question, including a pyparsing solution that handles not just single mismatch, but also n-mismatch. Did you try any of those? – PaulMcG Feb 21 '12 at 02:03
  • @PaulMcGuire: Yeah, it should. I only noticed after posting... – Randomblue Feb 21 '12 at 02:06

2 Answers2

1

Yes, absolutely. For example, if t is "abcde", then one such regex is

.bcde|a.cde|ab.de|abc.e|abcd.

That said, this is almost certainly not the best or most efficient way to do this, especially if t is at all big. (If it is big, then you can improve its performance somewhat by reformulating it as

.bcde|a(?:.cde|b(?:.de|c(?:.e|d.)))

or perhaps as

a(?:b(?:c(?:d.|.e)|.de)|.cde)|.bcde

but it's still not the best approach.)

ruakh
  • 175,680
  • 26
  • 273
  • 307
1

I wouldn't necessarily do this with regex. You can use Levenshtein distance.

>>> import Levenshtein
>>> s = "spam ham and eggs"
>>> t = "ram"
>>> for i,_ in enumerate(s): 
...   s_ = s[i:i+len(t)]
...   if Levenshtein.distance(s_, t) == 1:
...     print s_
... 
pam
ham
wim
  • 338,267
  • 99
  • 616
  • 750
  • 1
    sounds suboptimal. Levenshtein will keep counting even after 1 error – mpen Feb 21 '12 at 01:44
  • yeah it will, but Levenshtein distance is pretty fast and it may not be performance critical code anyway. premature optimisation and all that. – wim Feb 21 '12 at 01:51
  • except that in his comments he said he was trying to do this: http://stackoverflow.com/questions/2420412/search-for-string-allowing-for-one-mismatch-in-any-location-of-the-string which is a rather large data set. i don't think this is premature. – mpen Feb 21 '12 at 02:26