0

I have this code fragment (Python 2.7):

from bs4 import BeautifulSoup

content = '  foo bar';
soup = BeautifulSoup(content, 'html.parser')
w = soup.get_text()

At this point w has a byte with value 160 in it, but it's encoding is ASCII.

How do replace all of the \xa0 bytes by another character?

I've tried:

w = w.replace(chr(160), ' ')
w = w.replace('\xa0', ' ')

but I am getting the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 0: ordinal not in range(128)

And why does BS return an ASCII encoded string with an invalid character in it?

Is there a way to convert w to a 'latin1` encoded string?

pedromendessk
  • 3,538
  • 2
  • 24
  • 36
ErikR
  • 51,541
  • 9
  • 73
  • 124

1 Answers1

3

At this point w has a byte with value 160 in it, but it's encoding is 'ascii'.

You have an unicode string:

>>> w
u'\xa0 foo bar'
>>> type(w)
<type 'unicode'>

How do replace all of the \xa0 bytes with another character?

>>> x = w.replace(u'\xa0', ' ')
>>> x
u'  foo bar'

And why does BS return an 'ascii' encoded string with an invalid character in it?

As mentioned above, it is not an ascii encoded string, but an Unicode string instance.

Is there a way to convert w to a 'latin1` encoded string?

Sure:

>>> w.encode('latin1')
'\xa0 foo bar'

(Note this last string is an encoded string, not an unicode object, and its representation is not prefixed by 'u' like the previous unicode objects).

Notes (edited):

  • If you are typing strings into your source files, note that encoding of source files matters. Python will assume your source files are ASCII. The command line interpreter, on the other hand, will assume you are entering strings in your default system encoding. Of course you can override all this.
  • Avoid latin1, use UTF-8 if possible: ie. w.encode('utf8')
  • When encoding and decoding can tell Python to ignore errors, or replace characters that cannot be encoded with some marker character . I don't recommend to ignore encoding errors (at least without logging them), except for the hopefully rare cases when you know there are encoding errors or you need to encode text into a more reduced character set, requiring replacement of the code points that cannot be represented (ie if you need to encode 'España' into ASCII, you definitely should replace the 'ñ'). But for these cases there are imho better alternatives and you should look into the magical unicodedata module (see https://stackoverflow.com/a/1207479/401656).
  • There is a Python Unicode HOWTO: https://docs.python.org/2/howto/unicode.html
Community
  • 1
  • 1
jjmontes
  • 24,679
  • 4
  • 39
  • 51
  • So the error message is not about `w`, but about `chr(160)` and `'\xa0'` - those are the strings that the ascii codec cannot handle. Is that right? – ErikR Sep 22 '15 at 21:12
  • Exactly. BTW, I added a lot of the information to my answer. – jjmontes Sep 22 '15 at 21:15
  • 1
    Thanks for the help. Honestly, I didn't find the Unicode HOWTO very helpful in answering my conceptual questions. Python Unicode support is similar to other languages. I wrote up another way of explaining it here: [(link)](https://gist.github.com/erantapaa/c6d7284e23c86f7a50d4) Comments are welcome. – ErikR Sep 23 '15 at 16:13
  • 1
    I enjoyed your article, and also think the Unicode Howto needs a lot of improvement. I too once suffered the difference between str and unicod, wish I had your article then. Other thing that drove me crazy is that both `str` and `unicode` have `encode()` and `decode()` methods, with their slightly different semantics, it was a gotcha for me. But Unicode matters will always have a steep learning curve. I can also recommend Joel's article: "[Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html)" – jjmontes Sep 23 '15 at 21:36