4

I'm learning Python's regular expression, following is working as I expected:

>>> import re
>>> re.split('\s+|:', 'find   a str:s2')
['find', 'a', 'str', 's2']

But when I change + to *, the output is weird to me:

>>> re.split('\s*|:', 'find  a str:s2')
['find', 'a', 'str:s2']

How is such pattern interpreted in Python?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Deqing
  • 14,098
  • 15
  • 84
  • 131

2 Answers2

9

The 'side effect' you are seeing is that re.split() will only split on matches that are longer than 0 characters.

The \s*|: pattern matches either on zero or more spaces, or on :, whichever comes first. But zero spaces matches everywhere. In those locations where more than zero spaces matched, the split is made.

Because the \s* pattern matches every time a character is considered for splitting, the next option : is never considered.

Splitting on non-empty matches is called out explicitly in the re.split() documentation:

Note that split will never split a string on an empty pattern match.

If you reverse the pattern, : is considered, as it is the first choice:

>>> re.split(':|\s*', 'find  a str:s2')
['find', 'a', 'str', 's2']
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • So for the first character 'f' in the string, can I say it matches the pattern, but won't be split because it is by "empty pattern match"? – Deqing Jun 24 '14 at 15:20
  • @Deqing: For `f`, the `\s*` part matches. It is a 0-width match, so no split takes place. Next, `i` is tested, and it too matches `\s*`, etc. – Martijn Pieters Jun 24 '14 at 15:22
-4

If you meant to do "or" for your matching, then you have to do something like this: re.split('(\s*|:)', 'find a str:s2') In short: "+" means "at least one character". "*" any (or none)

nochkin
  • 692
  • 6
  • 17