9

Here is what you can find in the str.strip documentation:

The chars argument is a string specifying the set of characters to be removed. If omitted or None, the chars argument defaults to removing whitespace.

Now my question is: which specific characters are considered whitespace?

These function calls share the same result:

>>> ' '.strip()
''
>>> '\n'.strip()
''
>>> '\r'.strip()
''
>>> '\v'.strip()
''
>>> '\x1e'.strip()
''

In this related question, a user mentioned that the str.strip function works with a superset of ASCII whitespace characters (in other words, a superset of string.whitespace). More specifically, it works with all unicode whitespace characters.

Moreover, I believe (but I'm just guessing, I have no proofs) that c.isspace() returns True for each character c that would also be removed by str.strip. Is that correct? If so, I guess one could just run c.isspace() for each unicode character c, and come up with a list of whitespace characters that are removed by default by str.strip.

>>> ' '.isspace()
True
>>> '\n'.isspace()
True
>>> '\r'.isspace()
True
>>> '\v'.isspace()
True
>>> '\x1e'.isspace()
True

Is my assumption correct? And if so, how can I find some proofs? Is there an easier way to know which specific characters are automatically removed by str.strip?

Riccardo Bucco
  • 13,980
  • 4
  • 22
  • 50
  • 4
    See the answer for [`str.split()`](https://stackoverflow.com/questions/61566711/which-characters-are-considered-whitespace-by-split) the definition of whitespace is the same there – Cory Kramer Sep 09 '22 at 12:06
  • 1
    @CoryKramer `'\x1d'.strip() == ''`, but `0x1d` (the group separator character) is not mentioned in the answer. Maybe that list is not completely updated? – Riccardo Bucco Sep 09 '22 at 12:13
  • 1
    Related: https://bugs.python.org/issue25433 ; It looks like CPython's sourcecode for str.split, str.strip and str.isspace all rely on a macro called `Py_UNICODE_ISSPACE`, which is defined here: [cpython/unicodeobject.h#L902](https://github.com/python/cpython/blob/main/Include/cpython/unicodeobject.h#L902) – Stef Sep 13 '22 at 13:32

1 Answers1

5

The most trivial way to know which characters are removed by str.strip() is to loop over each possible characters and check if a string containing such character gets altered by str.strip():

c = 0
while True:
  try:
    s = chr(c)
  except ValueError:
    break
  if (s != s.strip()):
    print(f"{hex(c)} is stripped", flush=True)
  c+=1

As suggested in the comments, you may also print a table to check if str.strip(), str.split() and str.isspace() share the same behaviour about white spaces:

c = 0
print("char\tstrip\tsplit\tisspace")
while True:
  try:
    s = chr(c)
  except ValueError:
    break
  stripped = s != s.strip()
  splitted = not s.split()
  spaced = s.isspace()
  if (stripped or splitted or spaced):
    print(f"{hex(c)}\t{stripped}\t{splitted}\t{spaced}", flush=True)
  c+=1

If I run the code above I get:

char    strip   split   isspace
0x9     True    True    True
0xa     True    True    True
0xb     True    True    True
0xc     True    True    True
0xd     True    True    True
0x1c    True    True    True
0x1d    True    True    True
0x1e    True    True    True
0x1f    True    True    True
0x20    True    True    True
0x85    True    True    True
0xa0    True    True    True
0x1680  True    True    True
0x2000  True    True    True
0x2001  True    True    True
0x2002  True    True    True
0x2003  True    True    True
0x2004  True    True    True
0x2005  True    True    True
0x2006  True    True    True
0x2007  True    True    True
0x2008  True    True    True
0x2009  True    True    True
0x200a  True    True    True
0x2028  True    True    True
0x2029  True    True    True
0x202f  True    True    True
0x205f  True    True    True
0x3000  True    True    True

So, at least in python 3.10.4, your assumption seems to be correct.

etuardu
  • 5,066
  • 3
  • 46
  • 58