3

I get a link from a web page by using beautiful soup library through a.get('href'). In the link there is a strange character ® but when I get it became ®. How can I encode it properly? I have already added at the beginning of the page # -*- coding: utf-8 -*-

r = requests.get(url)

soup = BeautifulSoup(r.text)
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
Mazzy
  • 13,354
  • 43
  • 126
  • 207
  • Not enough information to answer your question. How to tell that it "became" "®"? Maybe it's just our output that is wrong? – ofrommel Jul 16 '14 at 20:48
  • I got in the terminal that character when I print the string – Mazzy Jul 16 '14 at 20:51
  • How did you load the page into BeautifulSoup? It was decoded as Latin1 instead of UTF-8 somewhere. The PEP263 comment applies *only* to your source code, not to any other data you loaded. – Martijn Pieters Jul 16 '14 at 20:53
  • I use the requests object. I'm updating the code – Mazzy Jul 16 '14 at 20:55

1 Answers1

6

Do not use r.text; leave decoding to BeautifulSoup:

soup = BeautifulSoup(r.content)

r.content gives you the response in bytes, without decoding. r.text on the other hand, is the response decoded to unicode.

What happens is that the server did not include the character-set in the response headers. At that moment, requests follows the HTTP RFC 2261, section 3.7.1: text/ responses by default are expected to use the ISO-8859-1 (Latin 1) character set.

For your HTML page, that default is wrong, and you got incorrect results; r.text decoded the bytes as Latin-1, resulting in a Mojibake:

>>> print u'®'.encode('utf8').decode('latin1')
®

HTML can itself include the correct encoding in the HTML page itself, in the form of a <meta> tag in the HTML header. BeautifulSoup will use that header and decode the bytes for you.

Even if the <meta> header tag is missing, BeautifulSoup includes other methods to auto-detect encodings.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343