Because that's what it's documented to do in the first two lines of the documentation (emphasis added):
Split string
by the occurrences of pattern
. If capturing parentheses are used in pattern
, then the text of all groups in the pattern are also returned as part of the resulting list.
The "why" for the feature itself is that sometimes you want to know what you captured, particularly when using a more complex pattern that could match all sorts of things, and you might need to adjust your code depending on what the split sequence was.
For the most simple example, if you want to mutate certain words in a sentence (in a sufficiently complicated way such that re.sub
isn't a reasonable option), then reconstruct the sentence exactly as it was, but with the new words, splitting on non-alphabetic characters, or on runs of whitespace, without capturing would make it impossible to reconstruct the form of the original sentence; even without mutating any words, using simple str.split
on runs of whitespace and just assuming it was single spaces, ' '.join('a\tb\nc d\re'.split())
would get back 'a b c d e'
; the moment you split without capturing, you lost data. By contrast, ''.join(re.split(r'(\s+)', 'a\tb\nc d\re'))
is lossless.
If you need to group without capturing, use non-capturing groups of the form (?:PAT)
instead of capturing, (PAT)
.