2

Is there a way to calculate minimal number of operations (insertions, deletions, substitutions) in order to convert some string to match some regular expression?

For example, minimal number of operations to convert string baba to match regex (ab)+ is 2: it needs to be turned either into ababab (+2 characters) or ab (-2 characters).

Right leg
  • 16,080
  • 7
  • 48
  • 81
Curious
  • 154
  • 1
  • 8
  • 3
    What would be the Levenstein difference between aaaaaaaa and the regex a{0,100} ? – DhruvPathak Sep 11 '17 at 12:13
  • I suppose it would be 0. – Curious Sep 11 '17 at 12:27
  • @DhruvPathak Granted it's not a good indicator *by itself*. But comparing the Levenshtein distance of two string to the regex they match could be interseting. – Right leg Sep 11 '17 at 12:27
  • @Curious More like `7` actually - you need to replace 7 characters in `aaaaaaaa` to get `a{0,100}`. The Leveinshtein distance is not related to regexes: it measures the distance between two raw strings. – Right leg Sep 11 '17 at 12:28
  • @Rightleg I think that regex _a{0,100}_ means that letter a occurs up to a hundred times in a string. Since string _aaaaaaaa_ completely matches this regular expression it's Levenshtein distance would be 0. – Curious Sep 11 '17 at 12:33
  • @Curious But unfortunately, that's not what Levenshtein distance acutally measures. It only compares the similarity of **two raw strings**. – Right leg Sep 11 '17 at 12:34
  • In fact, I want to calculate the minimum number of operations needed to convert a string to match particular regular expression. – Curious Sep 11 '17 at 12:35
  • @Rightleg Yes, I know that. But, I want to know is this task possible to solve and how. I may have clumsily titled the question. – Curious Sep 11 '17 at 12:37
  • Please let me know if you manage to achieve anything with the vague idea I proposed. I'd be glad to see if one can make something out of it. – Right leg Sep 11 '17 at 14:11

1 Answers1

4

You could compute the Levenshtein distance between a string and a regular expression, but it wouldn't make much sense, because this measures the similarity of two raw strings.

What you want is probably to measure the number of operations that the string would need to undergo in order to match the pattern.

Here is a solution, based on the graph theory.

First, we'll need to build the automaton that represents the language defined by the regular expression. Thompson's algorithm will help you here.

Once we've constructed this automaton, we can try our best to make the expression match, and modify it whenever it's going off. The number of modifications will be the distance from the string to the language described by the regex.

Here is an example, for a basic regex: .*a.*. The corresponding automaton is:

enter image description here

Let's check the distance of the bbb string.

  1. The first character is b, so we stay in 1
  2. b again, so we stay in 1
  3. Still b, we stay in 1
  4. The string is empty, and we're still not in the final state 2. So let's add an a. Now, we can go to 2.
  5. We're in 2, and no character is left in the string, so we're good.

We made one modification to the original string, so the distance from bbb to the .*a.* regex is 1. In fact, the distance of any word to that language is either 0 or 1, because it is equivalent to that word containing an a, which is binary.

This is only an idea, and it might have flaws, but I think that you can do something with that.

Right leg
  • 16,080
  • 7
  • 48
  • 81