How to detect if a regex is replacable with a static text search?

Question

How to detect if a certain regex string is "simple", that is it could be replaced with a simple string (and by that avoid using regex at all).

For example:

input regex        simple text form (if possible)
--------------------------------------------------
foo\.bar     --->  foo.bar
foo          --->  foo
ba\[12\]r    --->  ba[12]r
ba.*foo      --->  (not possible to represent as plain string)

Basically I'm looking for the opposite of the mythical RegExp.escape described in this answer, a RegExp.unescape, that would either do the opposite of mentioned RegExp.escape or report in some way that the conversion is not possible.

Looking for a JavaScript solution, but Java is also acceptable.

Something like `if (s.indexOf(s.replace(/\\([^])/g, '$1')) > -1) { console.log("Yes, it is 'simple'"); }`? That will sometimes result in false positives though. Like with ``abc \wat`` regex and `abc wat` string. — Wiktor Stribiżew, Jul 15 '19 at 18:56
@WiktorStribiżew can you explain what you've done there? I'm struggling to follow. — Robert Bain, Jul 15 '19 at 19:04
Well, that is your logic: remove escaping and check if a string contains the new string. I think it won't work in 100% cases. — Wiktor Stribiżew, Jul 15 '19 at 19:06
This seems straightforward to write… until you realize you not only have to search for metacharacters, but you also have to write a parser for `\c`, `\u`, and `\0`. Certainly doable, but not a trivial effort. — VGR, Jul 16 '19 at 03:07
How important is it to your that you catch all cases? @AmigoJack points out challenges in his answer, but are you looking for something that just works most of the time, or do you need it to be all of the time? I'm assuming false positives would be bad, but with false negatives you'd only lose some optimization potential, which could be acceptable, right? — joanis, Jul 16 '19 at 11:47

score 2 · Answer 1 · answered Jul 18 '19 at 08:33

Basically I'm looking for the opposite of the mythical RegExp.escape described in this answer, a RegExp.unescape, that would either do the opposite of mentioned RegExp.escape or report in some way that the conversion is not possible.

This function should handle most cases. If the conversion is not possible, it returns undefined. What it doesn't (but could be extended to) do is e. g. convert a{3} to aaa or [a] to a.

function cape(inre)
{
  if (/[iy]/.test(inre.flags)) return // these flags are not "simple"
  instr = inre.source
  outstr = ''
  special = '\\^$*+?.()|{}[]'
  // these so-called non-special characters after \ are not "simple":
  non_special = 'bBdDsSwW123456789'
  function pint(base, size)
  { // helper function for \0, \xhh and \uhhhh
    for (n = l = 0; l < size && !isNaN(d = parseInt(instr[i+1], base)); ++i, ++l)
      n = n*base+d
    return String.fromCharCode(n)
  }
  for (i = 0; c = instr[i]; outstr += c, ++i)
  { // convert input sequence to output text if possible
    if (c == '\\')
    {
      if (0 <= special.indexOf(c = instr[++i])) ; else
      if (0 <= non_special.indexOf(c)) return; else
      if (c == 'c') c = String.fromCharCode(instr.charCodeAt(++i)&31); else
      if (c == 'f') c = '\f'; else
      if (c == 'n') c = '\n'; else
      if (c == 'r') c = '\r'; else
      if (c == 't') c = '\t'; else
      if (c == 'v') c = '\v'; else
      if (c == '0') c = pint(8, 3); else
      if (c == 'x') c = pint(16, 2); else
      if (c == 'u') c = pint(16, 4)
    }
    else
      if (0 <= special.indexOf(c)) return
  }
  return outstr
}
console.log('input regex        simple text form (if possible)')
console.log('--------------------------------------------------')
testcases = [/foo\.bar/, /foo/, /ba\[12\]r/, /ba.*foo/]
for (i in testcases)
  s = testcases[i],
  console.log(s.source, ' '.repeat(11-s.source.length), '---> ', cape(s))

Why should it break? It yields \\ ---> \, i. e. the correct string of one backslash. — Armali, Jul 19 '19 at 08:47
Because `instr[++i]` should be outside the string when \\ is the last character — AmigoJack, Jul 19 '19 at 09:17
I'm speaking of one character, not two. A string consisting of one backslash characters, properly escaped as \\ — AmigoJack, Jul 19 '19 at 13:25
The function receives not a _string_, but a `RegExp` argument, so talking about a string is meaningless in that context. — Armali, Jul 19 '19 at 13:32
Alright, I had the wrong understanding that `++i` twice in the chained IFs could have left the "source" string's length already and yet fail to grasp why it doesn't. — AmigoJack, Jul 20 '19 at 14:46
Well, finally it _could have left the "source" string's length_, but then _`[]` returns `undefined`_, which is a condition for leaving the loop. — Armali, Jul 22 '19 at 06:19

score 1 · Answer 2 · answered Jul 16 '19 at 07:42

Unescaping would not be enough, it comes down at parsing the whole regex. Almost any non-word character has a meaning, hence it can't be simply detected. Only looking for backslashes is by far not enough - consider these cases:

Once[.]{1,1} has no escaping, yet it is obvious it could be a string
\x31\x{0032} is also obvious
fo{2}d is even more obvious, but recognizing this thru code is not trivial, tho

You'd first have to optimize a given regex to remove all redundancy and substitute the wanted characters, to then see what's left. Which is the point of parsing it already.

Those examples can be left undetected, as far as I am concerned. — David Balažic, Jul 16 '19 at 12:45

How to detect if a regex is replacable with a static text search?

2 Answers2