4

I'm currently studying regular expressions and have come across an inquiry. So the title of the question is what I'm trying to find out. I thought since \s represents a white space, re.split(" ", string) and re.split("\s+", string) would give out same values, as shown next:

>>> import re
>>> a = re.split(" ", "Why is this wrong")
>>> a
["Why", "is", "this", "wrong"]
>>> import re
>>> a = re.split("\s+", "Why is this wrong")
>>> a
["Why", "is", "this", "wrong"]

These two give out the same answers so I thought that they were the same thing. However, it turns out that these are different. In what case would it be different? And what am I missing here that is blinding me?

Sihwan Lee
  • 179
  • 2
  • 10
  • 3
    `"\s+"` represents one or more of **any** whitespace, including `" ", "\t", "\n"` and a couple more. `" "` is just a single space character. – user2390182 Dec 24 '20 at 13:21
  • 2
    @schwobaseggl so `\s` can also represent more than just `" "` and can express `Enter (which is equal to \n)`, or `" ", with two space characters`? – Sihwan Lee Dec 24 '20 at 13:23
  • Does this answer your question? [Reference - What does this regex mean?](https://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean) – Tim Biegeleisen Dec 24 '20 at 13:24

3 Answers3

12

This only look similar based on your example.

A split on ' ' (a single space) does exactly that - it splits on a single space. Consecutive spaces will lead to empty "matches" when you split.

A split on '\s+' will also split on multiple occurences of those characters and it includes other whitespaces then "pure spaces":

import re

a = re.split(" ", "Why    is this  \t \t  wrong")
b = re.split("\s+", "Why    is this  \t \t  wrong")

print(a)
print(b)

Output:

# re.split(" ",data)
['Why', '', '', '', 'is', 'this', '', '\t', '\t', '', 'wrong']

# re.split("\s+",data)
['Why', 'is', 'this', 'wrong']

Documentation:

\s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v]. (https://docs.python.org/3/howto/regex.html#matching-characters)

Patrick Artner
  • 50,409
  • 9
  • 43
  • 69
4

It means about space characters. '\s' is split with any whitespaces characters(\b, \t, \n, \a, \r etc.). '+' is if it's following whitespaces. For example " \n \r  \t \v". In my opinion, if you need to use directly string operations for separation, you should use my_string.split() like standart methods. Otherwise you should you regex. Because regex engine has a cost and developer should be able to predict that.

2

In terms of the code you posted, the general idea of it is there is not much of a difference of the two (in terms of its goal), both are going to output this.

["Why", "is", "this", "wrong"]

The difference is just... I would say the WAY on how you are going to split the string. In this case the first one is using the .split() built-in method in a str object, the second one is using the .split() function from re.

Now this one re.split(" ", "Why is this wrong") just splits the string base on this character right here " " your first parameter or argument

Now this one re.split("\s+", "Why is this wrong") splits your string based on this regular expression \s+.

Take note that " " is not the same as \s+. This \s+ has more like a meaning on what it is & the " " is just basically a str. You can find out more about regex here.

\s+ -> Returns a match where the string contains a white space character

I should also say that if you want to split a string based on not just a string or you want to have it more like a pattern? Then regex is for you.

Ice Bear
  • 2,676
  • 1
  • 8
  • 24