27
$ node
> "ababaabab".split(/a{2}/)
[ 'abab', 'bab' ]
> "ababaabab".split(/(a){2}/)
[ 'abab', 'a', 'bab' ]
>

So, this doesn't make sense to me. Can someone explain it? I don't get why the 'a' shows up.

Note: I am trying to match for doubled line endings (possibly on windows files) so I am splitting on /(\r?\n){2}/. However I get extraneous '\015\n' entries in my array (note \015 == \r).

Why are these showing up?

Note: also affects JS engine in browsers so this is specific to JS not node.

Steven Lu
  • 41,389
  • 58
  • 210
  • 364

5 Answers5

28

In your second result, a is appearing because you've wrapped it in a capture group () (parentheses).

If you want to not include it but you still require a conditional group, use a non-capturing group: (?:a). The questionmark-colon can be used inside any capture group and it will be omitted from the resulting list of captures.

Here's a simple example of this in action: http://regex101.com/r/yM1vM4

brandonscript
  • 68,675
  • 32
  • 163
  • 220
8

According to ECMA:

String.prototype.split (separator, limit)

If separator is a regular expression that contains capturing parentheses, then each time separator is matched the results (including any undefined results) of the capturing parentheses are spliced into the output array.

The example given was:

"ababaabab".split(/(a){2}/) // [ "abab", "a", "bab" ]

The split occurs on aa, but only "a" is in the capturing group (a) so that is what is spliced into the output array.

More examples:

"ababaaxaabab".split(/(a){2}/) // ["abab", "a", "x", "a", "bab"]

"ababaaxaabab".split(/(aa)/) // ["abab", "aa", "x", "aa", "bab"]
Community
  • 1
  • 1
Matt
  • 20,108
  • 1
  • 57
  • 70
  • In my opinion, this is a more helpful answer. It explains *why* the behavior is happening and where to read the official documentation. – Daniel Kaplan Apr 24 '22 at 06:38
3

Because the {2} is outside the capturing brackets, I'm guessing it splits on 2 characters, but only captures the first.

If you move the {2} inside the brackets:

"ababaabab".split(/(a{2})/)

then you'll get

["abab", "aa", "bab"]

If you don't want the 'aa's, don't group it in brackets. i.e.

"ababaabab".split(/a{2}/)

Gives

["abab", "bab"]
Nick Grealy
  • 24,216
  • 9
  • 104
  • 119
  • `()` is capturing the last match, not first... as in `"ababazbab".split(/(a|z){2}/)` => `[..., "z", ...]` – Aprillion Jul 18 '18 at 08:41
2

In regular expressions () denotes a capturing group. To not capture it use a non-capturing group (?:).

  • 2
    Still makes no sense that I would want to keep the group when I am splitting. But that's fine. `(?:)` it is. – Steven Lu Jan 29 '14 at 00:17
1

split keeps capturing groups. That's why you see it in the result.

Look at the description and capturing parentheses:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/split

gtournie
  • 4,143
  • 1
  • 21
  • 22