3

I'm working on a JavaScript-based project that involves a rudimentary Bash-inspired scripting system, and I'm using a regex to separate lines into (a number of types of) tokens.

One such token class is of course the recursive $() construct. This construct can be nested arbitrarily. I am trying to devise a JavaScript regular expression to match this type of token without accidentally leaving parts behind or grabbing parts of other tokens.

The problem, more specifically:

Given a string such as this example:
"$(foo $(bar) fizz $(buzz)) $(something $(else))"
which is made up of individual tokens, each delimited by an outer $() and followed either by whitespace or end-of-string, match the first such token in the string from and including its open $( to and including its final closing ) In the event that an unbalanced construct occurs anywhere in the string, the behavior of this regex is considered undefined and does not matter.

So in the above example the regex should match "$(foo $(bar) fizz $(buzz))"

Further usage details:

Both the input string and the returned match are passed through String.prototype.trim(), so leading and trailing whitespace doesn't matter

I am able to deem unbalanced constructs an undefined case because the code that consumes this type of token once extracted does its own balance checking. Even if the regex returns a match that is surrounded by an outer $(), the error will eventually be caught elsewhere.

What I've tried so-far

For a while I've been using this regex:
/\$\(.*?\)(?!(?:(?!\$\().)*\))(?:\s+|$)/
which seemed to work for quite some time. It matches arbitrarily nested balanced constructs so long as they do not have multiple nests at the same level. That case just slipped past my mind somehow when testing initially. It consumes the contents of the token with lazy repetition, and asserts that after the closing paren there is not another close paren before there is an open $(. This is of course broken by tokens such as the example above.

I know that the traditional "balanced constructs regex problem" is not solvable without subroutines/recursion. However I'm hoping that since I only need to match balanced constructs and not fail to match imbalanced ones, that there's some clever way to cheat here.

Stephen Cohen
  • 370
  • 1
  • 2
  • 12

1 Answers1

2

So in the above example the regex should match $(foo $(bar) fizz $(buzz))

The solution as I see it is almost the same as I posted today (based on Matching Nested Constructs in JavaScript, Part 2 by Steven Levithan), but all you need is to add the delimiters since they are known.

Example usage:

matchRecursiveRegExp("$(foo $(bar) fizz $(buzz)) $(something $(else))", "\\$\\(", "\\)");

Code:

     function matchRecursiveRegExp (str, left, right, flags) {
            var f = flags || "",
      g = f.indexOf("g") > -1,
      x = new RegExp(left + "|" + right, "g" + f),
      l = new RegExp(left, f.replace(/g/g, "")),
      a = [],
      t, s, m;

     do {
      t = 0;
      while (m = x.exec(str)) {
       if (l.test(m[0])) {
        if (!t++) s = x.lastIndex;
       } else if (t) {
        if (!--t) {
         a.push(str.slice(s, m.index));
         if (!g) return a;
        }
       }
      }
     } while (t && (x.lastIndex = s));

     return a;
    }
    document.write("$(" + matchRecursiveRegExp("$(foo $(bar) fizz $(buzz)) $(something $(else))", "\\$\\(", "\\)") + ")");

Note this solution does not support global matching.

Community
  • 1
  • 1
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • 1
    Took me some time to figure out the logic here with those variable names (eventually found the more wordy original version through your link), but now that I get it that's actually rather elegant. I'd have to a bit of refactoring to use a function here, but it's certainly doable. You'll get the answer if nothing slicker comes around (probably won't). – Stephen Cohen Aug 13 '15 at 20:49