Javascript regex: How to avoid a capturing group becoming "undefined"?

Question

If I want to capture e.g. text that is either in round brackets or square brackets and use this regular expression:

\[(.+)\]|\((.+)\)

I get for the example "[test]" the results "test" and "undefined" and for "(test)" the results "undefined" and "test". How can I manage to get only "test" as result?

(This regex is only an example, my actual regex is more complex but with the same problem.)

https://stackoverflow.com/q/5367369/2864740 - named capture groups (2018+ only!), or post-manipulation. — user2864740, Sep 26 '20 at 17:18
Without a branch reset feature, all you can do is remove the undefined value from the result. — Wiktor Stribiżew, Sep 26 '20 at 18:21

lrn · Answer 1 · 2023-05-08T08:31:26.607

If you use look-ahead to match either option first, then capture again in a second pass, you can get the match into a single capture.

The simplest approach uses other captures too:

(?=\[(.+?)\]|\((.+?)\))[(\[](\1\2)[)\]]

Works by: Matching either [...] or (...) as look-ahead, capturing the text between the delimiters into a capture 1 or 2. Then it captures the same text again, ignoring the delimiter, by backreferencing \1\2, relying on back-reference to a non-participating match to match the empty string. This way, the same string is captured into capture 3, which is always participating.

It's probably fairly efficient. The back-reference to the same position should match in no time.

If that's not good enough, and you want a RegExp with precisely one capture, which is the text between [..] or (..), then I'd try look-behinds instead:

[(\[](.+?)(?:(?=\))(?<=\(\1)|(?=\])(?<=\[\1))

It matches a [ or (, then tries to find a capture after it which, is followed by either ) or ], and then it does a backwards check to see if the leading delimiter was the matching ( or [ respectively.

Unlikely to be as efficient, but only matches (...) and [...] and captures what's between them in the single capture. If the look-behind back-reference to the same position is efficient (not as guaranteed, but possible), it's potentially not bad. If it's not efficient, it may do a lot of looking back (but only when seeing a possible end-) or -]).

It can also be converted to a RegExp which matches only the text you want, so "capture zero" is the result (as well as capture 1, which it uses internally), by matching the leading [ or ( with a look-behind:

(?<=[(\[])(.+?)(?:(?=\))(?<=\(\1)|(?=\])(?<=\[\1))

(Look-behinds, and -aheads, really is the gift that keeps giving when it comes to RegExp power. Both look-ahead and look-behind allows you to match the same sub-string more than once, using different RegExps, and even allows the later ones refer to captures from earlier matches.)

score -1 · Answer 2 · answered Sep 26 '20 at 17:19

If the specific group numbers that get captured don't matter, just the text they contain, I think the easiest thing is to just filter the match afterwards to remove the undefined groups:

for (const match of ' [foo] (bar) '.matchAll(/\[(.+)\]|\((.+)\)/g)) {
  const [, text] = match.filter(m => m !== undefined);
  console.log(text);
}

Javascript regex: How to avoid a capturing group becoming "undefined"?

2 Answers2