18

I am porting some Python 2 code that calls split() on strings, so I need to know its exact behavior. The documentation states that when you do not specify the sep argument, "runs of consecutive whitespace are regarded as a single separator".

Unfortunately, it does not specify which characters that would be. There are some obvious contenders (like space, tab, and newline), but Unicode contains plenty of other candidates.

Which characters are considered to be whitespace by split()?

Since the answer might be implementation-specific, I'm targeting CPython.

(Note: I researched the answer to this myself since I couldn't find it anywhere, so I'll be posting it here, hopefully for the benefit of others.)

Aasmund Eldhuset
  • 37,289
  • 4
  • 68
  • 81
  • Related: https://bugs.python.org/issue25433 – Stef Sep 13 '22 at 13:25
  • It looks like CPython's sourcecode for str.split, str.strip and str.isspace all rely on a macro called `Py_UNICODE_ISSPACE`, which is defined here: [cpython/unicodeobject.h#L902](https://github.com/python/cpython/blob/main/Include/cpython/unicodeobject.h#L902) – Stef Sep 13 '22 at 13:32
  • [Similar question about str.strip](https://stackoverflow.com/questions/73661849/which-specific-characters-does-the-strip-function-remove), with an answer that lists all unicode characters that count as whitespace for str.split, str.strip or str.isspace. – Stef Sep 13 '22 at 13:33

2 Answers2

22

Unfortunately, it depends on whether your string is an str or a unicode (at least, in CPython - I don't know whether this behavior is actually mandated by a specification anywhere).

If it is an str, the answer is straightforward:

  • 0x09 Tab
  • 0x0a Newline
  • 0x0b Vertical Tab
  • 0x0c Form Feed
  • 0x0d Carriage Return
  • 0x20 Space

Source: these are the characters with PY_CTF_SPACE in Python/pyctype.c, which are used by Py_ISSPACE, which is used by STRINGLIB_ISSPACE, which is used by split_whitespace.

If it is a unicode, there are 29 characters, which in addition to the above are:

  • U+001c through 0x001f: File/Group/Record/Unit Separator
  • U+0085: Next Line
  • U+00a0: Non-Breaking Space
  • U+1680: Ogham Space Mark
  • U+2000 through 0x200a: various fixed-size spaces (e.g. Em Space), but note that Zero-Width Space is not included
  • U+2028: Line Separator
  • U+2029: Paragraph Separator
  • U+202f: Narrow No-Break Space
  • U+205f: Medium Mathematical Space
  • U+3000: Ideographic Space

Note that the first four are also valid ASCII characters, which means that an ASCII-only string might split differently depending on whether it is an str or a unicode!

Source: these are the characters listed in _PyUnicode_IsWhitespace, which is used by Py_UNICODE_ISSPACE, which is used by STRINGLIB_ISSPACE (it looks like they use the same function implementations for both str and unicode, but compile it separately for each type, with certain macros implemented differently). The docstring describes this set of characters as follows:

Unicode characters having the bidirectional type 'WS', 'B' or 'S' or the category 'Zs'

Aasmund Eldhuset
  • 37,289
  • 4
  • 68
  • 81
  • Have you looked at [`string.whitespace`](https://docs.python.org/3/library/string.html#string.whitespace)? – awarrier99 May 02 '20 at 21:47
  • Not sure which characters exactly that entails but I'm sure you could print out their codes to check. Seems to be this string here ' \t\n\r\x0b\x0c' – awarrier99 May 02 '20 at 21:47
  • 2
    @awarrier99: Neither `str.split` nor `unicode.split` actually uses `string.whitespace`. – user2357112 May 02 '20 at 21:49
  • @user2357112supportsMonica huh I would've thought otherwise. Guess I should've checked the source first – awarrier99 May 02 '20 at 21:50
  • @awarrier99: Neat - that just gives the ASCII set, though, and since the documentation doesn't reference it, there is no guarantee that that's what's being used by `split()` (and it turns out that it's not) or that it agrees with what `split()` does (which turns out to be the case for `str`, but the docs don't promise that). – Aasmund Eldhuset May 02 '20 at 21:51
  • I wonder, though, if this is undefined behavior. The list shown could just be an implementation detail of CPython, though that probably establishes a *de facto* definition. – chepner May 02 '20 at 21:55
  • @AasmundEldhuset yup I may have jumped to some conclusions there. Interesting though that the docs don't specifically attribute what is being used. This [question](https://stackoverflow.com/questions/37903317/is-there-a-python-constant-for-unicode-whitespace) (if you haven't already come across it) has a few more details about what is used under the hood – awarrier99 May 02 '20 at 22:02
  • @chepner: Good point, though I hope you mean _unspecified_ or _implementation-specified_ ("the implementation may do it in a reasonable way") rather than _undefined_ ("literally anything might happen")? I made my question and answer CPython-specific. – Aasmund Eldhuset May 02 '20 at 22:03
  • @awarrier99: Thanks - I had not previously seen that answer. Even if I had, it wouldn't directly have answered my question about what `split()` does, but maybe it would have helped me find the answer more quickly since it points out `_PyUnicode_IsWhitespace`. – Aasmund Eldhuset May 02 '20 at 22:06
  • @AasmundEldhuset I'm very bad at keeping the two terms straight. CPython is definitely doing something reasonable :) – chepner May 03 '20 at 01:17
3

The answer by Aasmund Eldhuset is what I was attempting to do but I was beaten to the punch. It shows a lot of research and should definitely be the accepted answer.

If you want confirmation of that answer (or just want to test it in a different implementation, such as a non-CPython one, or a later one which may use a different Unicode standard under the covers), the following short program will print out the actual characters that cause a split when using .split() with no arguments.

It does this by constructing a string with the a and b characters(a) separated by the character being tested, then detecting if split creates an array more than one element:

int_ch = 0
while True:
    try:
        test_str = "a" + chr(int_ch) + "b"
    except Exception as e:
        print(f'Stopping, {e}')
        break
    if len(test_str.split()) != 1:
        print(f'0x{int_ch:06x} ({int_ch})')
    int_ch += 1

The output (for my system) is as follows:

0x000009 (9)
0x00000a (10)
0x00000b (11)
0x00000c (12)
0x00000d (13)
0x00001c (28)
0x00001d (29)
0x00001e (30)
0x00001f (31)
0x000020 (32)
0x000085 (133)
0x0000a0 (160)
0x001680 (5760)
0x002000 (8192)
0x002001 (8193)
0x002002 (8194)
0x002003 (8195)
0x002004 (8196)
0x002005 (8197)
0x002006 (8198)
0x002007 (8199)
0x002008 (8200)
0x002009 (8201)
0x00200a (8202)
0x002028 (8232)
0x002029 (8233)
0x00202f (8239)
0x00205f (8287)
0x003000 (12288)
Stopping, chr() arg not in range(0x110000)

You can ignore the error at the end, that's just to confirm it doesn't fail until we've moved out of the valid Unicode area (code points 0x000000 - 0x10ffff making up the seventeen planes).


(a) I'm hoping that no future version of Python ever considers a or b to be whitespace, as that would totally break this (and a lot of other) code.

I think the chances of that are rather slim, so it should be fine :-)

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • Thanks! I was kind of cheating in that I typed up the answer in advance; the purpose of my question was just to document my find for others. I'm wondering why I didn't think of determining it experimentally like this, so thanks and +1! – Aasmund Eldhuset May 03 '20 at 00:19
  • 2
    @Aasmund, not really "cheating" :-) Since one of the primary goals of SO is as a knowledge repo, if you have a question that's not asked already, it's considered valid to do so, and then answer it yourself. – paxdiablo May 03 '20 at 00:38
  • 1
    That was the idea :-) – Aasmund Eldhuset May 03 '20 at 01:42