8

I have a Unicode string with some non-breaking spaces at the beginning and end. I get different results when using strip() vs. strip(string.whitespace).

>>> import string
>>> s5 = u'\xa0\xa0hello\xa0\xa0'
>>> print s5.strip()
hello
>>> print s5.strip(string.whitespace)
  hello  

The documentation for strip() says, "If omitted or None, the chars argument defaults to removing whitespace." The documentation for string.whitespace says, "A string containing all characters that are considered whitespace."

So if string.whitespace contains all characters that are considered whitespace, then why are the results different? Does it have something to do with Unicode?

I am using Python 2.7.6

jscs
  • 63,694
  • 13
  • 151
  • 195
Becca codes
  • 542
  • 1
  • 4
  • 14
  • 1
    `string.whitespace` is `" \t\n\r\x0b\x0c"` on my Python 3.2.3. Clearly Unicode is out of the picture. – Frédéric Hamidi Mar 06 '14 at 16:23
  • The documentation doesn't say that `string.whitespace` is used by `unicode.strip` to define what is and is not whitespace, however. I believe most of the `string` module is deprecated, having been folded into the `str` class itself. – chepner Mar 06 '14 at 16:42
  • 1
    @chepner apart from constants, Template, Formatter and maketrans - yup... pretty much all on the class now – Jon Clements Mar 06 '14 at 16:44

1 Answers1

14

From the documentation of the string.whitespace:

A string containing all ASCII characters that are considered whitespace. This includes the characters space, tab, linefeed, return, formfeed, and vertical tab.

It's the same under python3, where all non-ASCII constants where removed. (In python2 some constants could be influenced by locale settings).

Hence the difference in behaviour is quite obvious since strip() does remove any unicode whitespace, while strip(string.whitespace) removes only ASCII spaces. Your string clearly contains non-ASCII spaces.

Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • Hmmm, interesting. The documentation from Python 3.1 (the link you posted) does say "ASCII characters". The documentation from Python 2.7 on [string.whitespace](http://docs.python.org/2/library/string.html#string.whitespace) does not specifically say "ASCII characters". I wonder if it is different in Python 2.7, or if the documentation is just lacking those two words. – Becca codes Mar 06 '14 at 17:21
  • @Beccacodes As I said in python2 some of those "constants" actually depend on the locale settings. In python3 this is not true. Note that `locale` != unicode. In python3 they decided to get rid of this `locale` dependent behaviour and only kept the ASCII variants. I don't know why they didn't change the name to `ascii_whitespace`. Also note that `string.whitespace` in python2 is a *byte* string, which should already tell you that it *cannot* contain all unicode whitespace characters. – Bakuriu Mar 06 '14 at 17:23
  • I have not worked with unicode much, so I am having a bit of trouble understanding. It _sounds_ like you are saying that in Python 2.7, string.whitespace _could_ contain some non-ASCII characters, based on locale settings. But it sounds like the character u'\xa0' would still not be a possible candidate. Why? – Becca codes Mar 06 '14 at 18:30
  • 1
    @Beccacodes `string.whitespace` is a byte string, this means it cannot contain a unicode "character" or "codepoint". At most it could contain a unicode character *encoded* in some encoding. For example `u'\xa0'.encode('utf-8')` *could* be found in `string.whitespace` using some locale settings. However, on my machine even changing the locale doesn't change `string.whitespace`. – Bakuriu Mar 06 '14 at 19:46