First of all, str
is built into python, which means that to look at the source for str.split
, you're going to have to delve into the C source code where it is defined.
Now, onto your actual question. I have a feeling that re.sub
is going to be not only overkill, but also slower than using the built-in str.split (full disclosure: I don't have timing data to back this up - it's just a feeling I have).
Now, str.split
splits on whitespace by default (it takes an optional argument, which can be used to specify the character on which to split). It also splits on any number of consecutive whitespace characters. Now, what this means is that if you have a string that contains whitespace characters within it, calling str.split
on that string will return you a list of non-empty substrings, none of which contain any whitespace whatsoever. Thus, if your string has heterogeneous consecutive whitespace characters, those whitespace characters are treated no differently from each other.
Here are a couple of examples:
In [31]: s = 'hello world' # one space
In [32]: s.split()
Out[32]: ['hello', 'world']
In [33]: s = 'hello \tworld' # multiple consecutive whitespace characters
In [34]: s.split()
Out[34]: ['hello', 'world']
In [35]: s = 'hello\tworld' # a different whitespace character
In [36]: s.split()
Out[36]: ['hello', 'world']
In [37]: s = 'hello\t\tworld' # multiple consecutive tab characters
In [38]: s.split()
Out[38]: ['hello', 'world']
In [39]: s = 'hello world' # multiple consecutive space characters
In [40]: s.split()
Out[40]: ['hello', 'world']
As you can see, it doesn't really matter how your spaces exist - think of str.split
splitting when "at least one whitespace character" presents itself.
Now, if you want to replace all consecutive whitespace characters with a single space, you could do it with a str.split
and a str.join
:
In [41]: ' '.join(['hello', 'world']) # join the strings 'hello' and 'world' with a space between them
Out[41]: 'hello world'
In [42]: s = 'hello world' # notice two spaces between 'hello' and 'world'
In [43]: ' '.join(s.split())
Out[43]: 'hello world' # notice only one space between 'hello' and 'world'