-1

I have a question regarding regular expressions in Python. I have tried to print out the result of re.split('(\d)', 'SPL5IT THE WORDS') and re.split('\d', 'SPL5IT THE WORDS'). The result is like this:

re.split('\d', 'SPL5IT THE WORDS')
Out[20]: ['SPL', 'IT THE WORDS']

re.split('(\d)', 'SPL5IT THE WORDS')
Out[21]: ['SPL', '5', 'IT THE WORDS']

Why will the second one return the separator, while the first one will not?

TylerH
  • 20,799
  • 66
  • 75
  • 101
BOWEN XIN
  • 17
  • 4
  • 1
    I removed the duplicate status because, while [the other question](https://stackoverflow.com/q/2136556/364696) wants to know *how* to do something, this one already knows, and asks *why* it behaves this way. I'm not sure *why* is a particularly useful question to ask, but it's not a duplicate. – ShadowRanger Oct 17 '18 at 11:00
  • @ShadowRanger At the very least, answering the question signals that you find it useful. – TylerH Oct 17 '18 at 15:28
  • @TylerH: Eh. Or it just means I was bored and didn't mind playing link monkey to the docs. – ShadowRanger Oct 17 '18 at 17:44
  • @TylerH: I decided it was at least a useful thing to answer why the feature exists (not just why it behaves this way), so I've updated [my answer](https://stackoverflow.com/a/52845961/364696). – ShadowRanger Oct 17 '18 at 17:54

1 Answers1

4

Because that's what it's documented to do in the first two lines of the documentation (emphasis added):

Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.

The "why" for the feature itself is that sometimes you want to know what you captured, particularly when using a more complex pattern that could match all sorts of things, and you might need to adjust your code depending on what the split sequence was.

For the most simple example, if you want to mutate certain words in a sentence (in a sufficiently complicated way such that re.sub isn't a reasonable option), then reconstruct the sentence exactly as it was, but with the new words, splitting on non-alphabetic characters, or on runs of whitespace, without capturing would make it impossible to reconstruct the form of the original sentence; even without mutating any words, using simple str.split on runs of whitespace and just assuming it was single spaces, ' '.join('a\tb\nc d\re'.split()) would get back 'a b c d e'; the moment you split without capturing, you lost data. By contrast, ''.join(re.split(r'(\s+)', 'a\tb\nc d\re')) is lossless.

If you need to group without capturing, use non-capturing groups of the form (?:PAT) instead of capturing, (PAT).

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271