3

Does Python use regex splitting when no separator is given?

I'm unable to look at str.__file__, neither do the other solutions work here since split is a function of the str type (though it is a built-in).

E.g. 'a\t\t\tb' --> ['a', 'b']

Background I'm considering replacing all adjacent whitespace with a single space for many files where performance is critical, though I'm wondering whether regex split is going to be fast enough: perhaps the built-in shows a better way.

Georgy
  • 12,464
  • 7
  • 65
  • 73
PascalVKooten
  • 20,643
  • 17
  • 103
  • 160
  • Actually, I finally found http://svn.python.org/view/python/trunk/Objects/stringlib/split.h?revision=77461&view=markup. It is a nice C function that does the split. – PascalVKooten Apr 15 '15 at 22:39

2 Answers2

1

First of all, str is built into python, which means that to look at the source for str.split, you're going to have to delve into the C source code where it is defined.

Now, onto your actual question. I have a feeling that re.sub is going to be not only overkill, but also slower than using the built-in str.split (full disclosure: I don't have timing data to back this up - it's just a feeling I have).

Now, str.split splits on whitespace by default (it takes an optional argument, which can be used to specify the character on which to split). It also splits on any number of consecutive whitespace characters. Now, what this means is that if you have a string that contains whitespace characters within it, calling str.split on that string will return you a list of non-empty substrings, none of which contain any whitespace whatsoever. Thus, if your string has heterogeneous consecutive whitespace characters, those whitespace characters are treated no differently from each other.

Here are a couple of examples:

In [31]: s = 'hello world'  # one space

In [32]: s.split()
Out[32]: ['hello', 'world']

In [33]: s = 'hello \tworld'  # multiple consecutive whitespace characters

In [34]: s.split()
Out[34]: ['hello', 'world']

In [35]: s = 'hello\tworld'  # a different whitespace character

In [36]: s.split()
Out[36]: ['hello', 'world']

In [37]: s = 'hello\t\tworld'  # multiple consecutive tab characters

In [38]: s.split()
Out[38]: ['hello', 'world']

In [39]: s = 'hello  world'  # multiple consecutive space characters

In [40]: s.split()
Out[40]: ['hello', 'world']

As you can see, it doesn't really matter how your spaces exist - think of str.split splitting when "at least one whitespace character" presents itself.

Now, if you want to replace all consecutive whitespace characters with a single space, you could do it with a str.split and a str.join:

In [41]: ' '.join(['hello', 'world'])  # join the strings 'hello' and 'world' with a space between them
Out[41]: 'hello world'

In [42]: s = 'hello  world'  # notice two spaces between 'hello' and 'world'

In [43]: ' '.join(s.split())
Out[43]: 'hello world'  # notice only one space between 'hello' and 'world'
inspectorG4dget
  • 110,290
  • 27
  • 149
  • 241
  • This is a good introduction I would say. Splitting and joining is suboptimal to just doing inplace replacement with C, isn't it? Is there any C function we could use (and even how?) – PascalVKooten Apr 16 '15 at 07:02
  • Also without data to back it up I also suspect that split would be better than the regex. – PascalVKooten Apr 16 '15 at 07:03
  • @PascalvKooten: I wouldn't venture to guess how to do it with C, though I can give you algorithmic pseudocode with optimal runtime (assuming that strings are implemented as null terminated arrays). Let me know if that's what you're after, and I'll post up such a solution as well – inspectorG4dget Apr 16 '15 at 21:51
0

It doesn't use regex, it uses <wctypes.h>'s iswspace(...)

We can see here that it uses the macro STRINGLIB_ISSPACE(...) https://github.com/certik/python-3.3/blob/master/Objects/stringlib/split.h

Which is defined here as wctypes.h's iswspace : http://svn.python.org/projects/python/trunk/Include/unicodeobject.h

Jules G.M.
  • 3,624
  • 1
  • 21
  • 35