1

I'm new to regular expressions, and I couldn't figure out how to get this to work by just googling it. I think part of my problem might be that I'm having trouble phrasing the question in search terms.

Here's my problem:

I have a string that looks like this:

OSDfhosjdjakjdnvkjndkfvjelkrjejrijrvrvrjvnkrjvnkn(mint (light) green pants)shdbfhsbdhfbsjd(couch)hvbjshdbvjhsbdfbjs(forest (dark) (stained) green shirt) sjdfjsdhfjshkdfjskdjfksjdfhfskdjf(table)

I want to select the entire contents of the parentheses containing the word "green," and only those parentheses. That is to say, I want to return "mint (light) green pants" and "forest (dark) (stained) green shirt" but not "couch", "table", or any of the gibberish.

What I've tried so far:

  • /(.*?green.*?/) seemed to return an almost arbitrary block of text surrounding "green" and beginning and ending with a /, which makes me think I screwed up escaping the parentheses somehow.

  • /(.*green.*/) seemed to return the entire document.

  • Googling the problem: It seems from the pages I'm finding here and on google that what I want is a lookbehind, a regex functionality that JavaScript doesn't support. Unfortunately, I'm working in JS, so I need a way to make this work.

Edited: I just realized that the text I want to be outputting contains more parentheses than I originally realized, and edited my example to reflect this.

Hexiva
  • 19
  • 2

2 Answers2

4

Instead of a lookahead you could make use of a capturing group. First match the opening parenthesis \( and then in a capturing group ( match all until the closing parenthesis \).

Your values will be in capturing group 1.

\(([^)]+\bgreen\b[^)]+)\)

Explanation

  • \( Match opening parenthesis
  • ([^)]+ Match not a ) using a negated character class
  • \bgreen\b Match the word green using word boundaries to make sure it is not part of a larger match
  • [^)]+ Match one or more times not a )
  • ) Close capturing group
  • \) Match )

const regex = /\(([^)]+\bgreen\b[^)]+)\)/g;
const str = `OSDfhosjdjakjdnvkjndkfvjelkrjejrijrvrvrjvnkrjvnkn(mint green pants)shdbfhsbdhfbsjd(couch)hvbjshdbvjhsbdfbjs(forest green shirt) sjdfjsdhfjshkdfjskdjfksjdfhfskdjf(table)`;
let m;
while ((m = regex.exec(str)) !== null) {
  if (m.index === regex.lastIndex) {
    regex.lastIndex++;
  }
  console.log(m[1]);
}

Edit

To match the balanced parenthesis before green, you could match not a closing parenthesis one or more times or match balanced parenthesis using a non capturing group (?: and an alternation (?:[^\)]+|\([^)]+\):

\(((?:[^\)]+|\([^)]+\))*\bgreen\b[^)]+)\)

const regex = /\(((?:[^\)]+|\([^)]+\))*\bgreen\b[^)]+)\)/g;
const str = `OSDfhosjdjakjdnvkjndkfvjelkrjejrijrvrvrjvnkrjvnkn(mint (light) green pants)shdbfhsbdhfbsjd(couch)hvbjshdbvjhsbdfbjs(forest (dark) (stained) green shirt) sjdfjsdhfjshkdfjskdjfksjdfhfskdjf(table)`;
let m;
while ((m = regex.exec(str)) !== null) {
  if (m.index === regex.lastIndex) {
    regex.lastIndex++;
  }
  console.log(m[1]);
}
The fourth bird
  • 154,723
  • 16
  • 55
  • 70
1

To match balanced parenthesis is not an easy problem and even harder to solve with JavaScript. Since the JS regex engine does not allow recursion. Let me cite Steven Levithan on that matter:

The problem, in this case, lies in how you distinguish between the last closing bracket ... and any of the inner brackets. The only difference between the last closing bracket and the inner brackets is that they are logically linked (i.e., they form an open/close pair). This logic is impossible to implement by simple lookaround assertion.

However, he concludes, if there is a known maximum amount of recursion that needs to be accounted for, it's possible.

Here's a solution that does not use any advanced regex features and works just fine with vanilla JavaScript.

\((?:\([^()]*?\)|([^()]*\bgreen\b[^()]*)?|[^()])*?\)

Explanation

  • \( Match opening parenthesis
  • (?:...) non-capputring group with alternations:
    • \([^()]*?\) Match inner pair of parenthesis, lazy match anything not a () using a negated character class, non greedy
    • ([^()]*\bgreen\b[^()]*)? Capture the optional word green with word boundaries in group 1, greedy
    • [^()] a "modified dot": anything not a () to keep parenthesis balanced
  • *? Close the non-capturing group, match zero or more times lazy
  • \) Match )

Demo

I use an extra capture group to meet the requirement with the given search term; if there is no $1 the full match is trash:

Sample Code:

const regex = /\((?:\([^()]*?\)|([^()]*\bgreen\b[^()]*)?|[^()])*?\)/gm;
const str = `OSDfhosjdjakjdnvkjndkfvjelkrjejrijrvrvrjvnkrjvnkn(mint (light) green pants)shdbfhsbdhfbsjd(couch)hvbjshdbvjhsbdfbjs(forest (dark) (stained) green shirt) sjdfjsdhfjshkd(fjskdjfksjdfhfskdjf(green table) (green)`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
      if(match && groupIndex === 1)
        console.log(`Found ${m[0]}`);
    });
}

Caveats, this works only if:

  • braces are actually balanced,
  • and the level of brace nesting is no more than one. If more levels are needed, adjust the pattern as shown by Steven.
wp78de
  • 18,207
  • 7
  • 43
  • 71