Use of capture groups in String.split()

Question

$ node
> "ababaabab".split(/a{2}/)
[ 'abab', 'bab' ]
> "ababaabab".split(/(a){2}/)
[ 'abab', 'a', 'bab' ]
>

So, this doesn't make sense to me. Can someone explain it? I don't get why the 'a' shows up.

Note: I am trying to match for doubled line endings (possibly on windows files) so I am splitting on /(\r?\n){2}/. However I get extraneous '\015\n' entries in my array (note \015 == \r).

Why are these showing up?

Note: also affects JS engine in browsers so this is specific to JS not node.

Interesting - same behavior in ruby too. – Matt Jan 29 '14 at 00:08 — Matt, Jan 29 '14 at 00:08

score 28 · Accepted Answer · answered Jan 29 '14 at 00:08

28

In your second result, a is appearing because you've wrapped it in a capture group () (parentheses).

If you want to not include it but you still require a conditional group, use a non-capturing group: (?:a). The questionmark-colon can be used inside any capture group and it will be omitted from the resulting list of captures.

Here's a simple example of this in action: http://regex101.com/r/yM1vM4

answered Jan 29 '14 at 00:08

brandonscript

68,675
32
163
220

Nail was hit on the head by @remus, thanks – Steven Lu Jan 29 '14 at 00:11
`====[] |--` I do that :) – brandonscript Jan 29 '14 at 00:12
Is the thing on the left the nail? I've never seen a square nail – Steven Lu Jan 29 '14 at 03:33
1

No, that was my best-effort hammer ;) On the right is the nail. – brandonscript Jan 29 '14 at 04:33
Ah that makes more sense – Steven Lu Jan 29 '14 at 07:32

score 8 · Answer 2 · edited Jun 20 '20 at 09:12

According to ECMA:

String.prototype.split (separator, limit)

If separator is a regular expression that contains capturing parentheses, then each time separator is matched the results (including any undefined results) of the capturing parentheses are spliced into the output array.

The example given was:

"ababaabab".split(/(a){2}/) // [ "abab", "a", "bab" ]

The split occurs on aa, but only "a" is in the capturing group (a) so that is what is spliced into the output array.

More examples:

"ababaaxaabab".split(/(a){2}/) // ["abab", "a", "x", "a", "bab"]

"ababaaxaabab".split(/(aa)/) // ["abab", "aa", "x", "aa", "bab"]

In my opinion, this is a more helpful answer. It explains *why* the behavior is happening and where to read the official documentation. — Daniel Kaplan, Apr 24 '22 at 06:38

score 3 · Answer 3 · answered Jan 29 '14 at 00:09

3

Because the {2} is outside the capturing brackets, I'm guessing it splits on 2 characters, but only captures the first.

If you move the {2} inside the brackets:

"ababaabab".split(/(a{2})/)

then you'll get

["abab", "aa", "bab"]

If you don't want the 'aa's, don't group it in brackets. i.e.

"ababaabab".split(/a{2}/)

Gives

["abab", "bab"]

answered Jan 29 '14 at 00:09

Nick Grealy

24,216
9
104
119

`()` is capturing the last match, not first... as in `"ababazbab".split(/(a|z){2}/)` => `[..., "z", ...]` – Aprillion Jul 18 '18 at 08:41

score 2 · Answer 4 · answered Jan 29 '14 at 00:14

2

In regular expressions () denotes a capturing group. To not capture it use a non-capturing group (?:).

answered Jan 29 '14 at 00:14

2

Still makes no sense that I would want to keep the group when I am splitting. But that's fine. `(?:)` it is. – Steven Lu Jan 29 '14 at 00:17

score 1 · Answer 5 · answered Jan 29 '14 at 00:08

1

split keeps capturing groups. That's why you see it in the result.

Look at the description and capturing parentheses:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split

answered Jan 29 '14 at 00:08

gtournie

4,143
1
21
22

Use non-capturing group? `(?:)` – elclanrs Jan 29 '14 at 00:09
Yeah, that's a `d'oh` moment. – Steven Lu Jan 29 '14 at 00:11

Use of capture groups in String.split()

5 Answers5

Linked

Related