-1

How to detect if a certain regex string is "simple", that is it could be replaced with a simple string (and by that avoid using regex at all).

For example:

input regex        simple text form (if possible)
--------------------------------------------------
foo\.bar     --->  foo.bar
foo          --->  foo
ba\[12\]r    --->  ba[12]r
ba.*foo      --->  (not possible to represent as plain string)

Basically I'm looking for the opposite of the mythical RegExp.escape described in this answer, a RegExp.unescape, that would either do the opposite of mentioned RegExp.escape or report in some way that the conversion is not possible.

Looking for a JavaScript solution, but Java is also acceptable.

David Balažic
  • 1,319
  • 1
  • 23
  • 50
  • 2
    Something like `if (s.indexOf(s.replace(/\\([^])/g, '$1')) > -1) { console.log("Yes, it is 'simple'"); }`? That will sometimes result in false positives though. Like with ``abc \wat`` regex and `abc wat` string. – Wiktor Stribiżew Jul 15 '19 at 18:56
  • @WiktorStribiżew can you explain what you've done there? I'm struggling to follow. – Robert Bain Jul 15 '19 at 19:04
  • Well, that is your logic: remove escaping and check if a string contains the new string. I think it won't work in 100% cases. – Wiktor Stribiżew Jul 15 '19 at 19:06
  • What's the `[^]` doing? – Robert Bain Jul 15 '19 at 19:09
  • This seems straightforward to write… until you realize you not only have to search for metacharacters, but you also have to write a parser for `\c`, `\u`, and `\0`. Certainly doable, but not a trivial effort. – VGR Jul 16 '19 at 03:07
  • How important is it to your that you catch all cases? @AmigoJack points out challenges in his answer, but are you looking for something that just works most of the time, or do you need it to be all of the time? I'm assuming false positives would be bad, but with false negatives you'd only lose some optimization potential, which could be acceptable, right? – joanis Jul 16 '19 at 11:47

2 Answers2

2

Basically I'm looking for the opposite of the mythical RegExp.escape described in this answer, a RegExp.unescape, that would either do the opposite of mentioned RegExp.escape or report in some way that the conversion is not possible.

This function should handle most cases. If the conversion is not possible, it returns undefined. What it doesn't (but could be extended to) do is e. g. convert a{3} to aaa or [a] to a.

function cape(inre)
{
  if (/[iy]/.test(inre.flags)) return // these flags are not "simple"
  instr = inre.source
  outstr = ''
  special = '\\^$*+?.()|{}[]'
  // these so-called non-special characters after \ are not "simple":
  non_special = 'bBdDsSwW123456789'
  function pint(base, size)
  { // helper function for \0, \xhh and \uhhhh
    for (n = l = 0; l < size && !isNaN(d = parseInt(instr[i+1], base)); ++i, ++l)
      n = n*base+d
    return String.fromCharCode(n)
  }
  for (i = 0; c = instr[i]; outstr += c, ++i)
  { // convert input sequence to output text if possible
    if (c == '\\')
    {
      if (0 <= special.indexOf(c = instr[++i])) ; else
      if (0 <= non_special.indexOf(c)) return; else
      if (c == 'c') c = String.fromCharCode(instr.charCodeAt(++i)&31); else
      if (c == 'f') c = '\f'; else
      if (c == 'n') c = '\n'; else
      if (c == 'r') c = '\r'; else
      if (c == 't') c = '\t'; else
      if (c == 'v') c = '\v'; else
      if (c == '0') c = pint(8, 3); else
      if (c == 'x') c = pint(16, 2); else
      if (c == 'u') c = pint(16, 4)
    }
    else
      if (0 <= special.indexOf(c)) return
  }
  return outstr
}
console.log('input regex        simple text form (if possible)')
console.log('--------------------------------------------------')
testcases = [/foo\.bar/, /foo/, /ba\[12\]r/, /ba.*foo/]
for (i in testcases)
  s = testcases[i],
  console.log(s.source, ' '.repeat(11-s.source.length), '---> ', cape(s))
Armali
  • 18,255
  • 14
  • 57
  • 171
  • This should break with an input like \\ – AmigoJack Jul 19 '19 at 08:23
  • Why should it break? It yields \\ ---> \, i. e. the correct string of one backslash. – Armali Jul 19 '19 at 08:47
  • Because `instr[++i]` should be outside the string when \\ is the last character – AmigoJack Jul 19 '19 at 09:17
  • \\ in the `RegExp.source` is not one character, but two. – Armali Jul 19 '19 at 12:52
  • I'm speaking of one character, not two. A string consisting of one backslash characters, properly escaped as \\ – AmigoJack Jul 19 '19 at 13:25
  • The function receives not a _string_, but a `RegExp` argument, so talking about a string is meaningless in that context. – Armali Jul 19 '19 at 13:32
  • 1
    Alright, I had the wrong understanding that `++i` twice in the chained IFs could have left the "source" string's length already and yet fail to grasp why it doesn't. – AmigoJack Jul 20 '19 at 14:46
  • Well, finally it _could have left the "source" string's length_, but then _`[]` returns `undefined`_, which is a condition for leaving the loop. – Armali Jul 22 '19 at 06:19
1

Unescaping would not be enough, it comes down at parsing the whole regex. Almost any non-word character has a meaning, hence it can't be simply detected. Only looking for backslashes is by far not enough - consider these cases:

  • Once[.]{1,1} has no escaping, yet it is obvious it could be a string
  • \x31\x{0032} is also obvious
  • fo{2}d is even more obvious, but recognizing this thru code is not trivial, tho

You'd first have to optimize a given regex to remove all redundancy and substitute the wanted characters, to then see what's left. Which is the point of parsing it already.

AmigoJack
  • 5,234
  • 1
  • 15
  • 31