0
txt = "The rain in Spain '69'"
x = re.split(r'\W*', txt)
print(x)

['', 'T', 'h', 'e', '', 'r', 'a', 'i', 'n', '', 'i', 'n', '', 'S', 'p', 'a', 'i', 'n', '', '6', '9', '', '']

txt = "The rain in Spain '69'"
x = re.split(r'\W+', txt)
print(x)

['The', 'rain', 'in', 'Spain', '69', '']

The documentation (python.org):

Another repeating metacharacter is +, which matches one or more times. Pay careful attention to the difference between * and +; * matches zero or more times, so whatever’s being repeated may not be present at all, while + requires at least one occurrence. To use a similar example, ca+t will match 'cat' (1 'a'), 'caaat' (3 'a's), but won’t match 'ct'.

Please explain this difference.

khelwood
  • 55,782
  • 14
  • 81
  • 108
  • Please read [**this.**](https://www.regular-expressions.info/refrepeat.html) –  May 29 '20 at 06:33

1 Answers1

0

the split function goes through the string looking for the pattern, when found, it makes a new element in the result array.

asterisk split

it starts the string, it sees nothing, this matches the pattern (0 or more). The array is now [''].

then it sees the first character, which is also zero or more. the array is now ['', 'T']. This continues until all characters match, and each one gets own element.

plus regex split

With this mode, the space before the first character does not match. only at the end of the word is the first non word character found (it needs at least one). If it were "a b" splitting on one or more non-word ("\W+") would result in ['a', 'b'] i think.

So it finds a space at the end of every word and splits there.

If there were no characters which fit the pattern, it does not split, ie, if the pattern is one or more 'a' and the input is 'ct' it should not split.

definition of 'word' in regex

Word and nonword: word is [a-zA-Z0-9_], so what you expect in a word, non word is everything typically between: [^a-zA-Z0-9_]

Dave Ankin
  • 1,060
  • 2
  • 9
  • 20