Yet another encoding issue with accented characters (scraping a Website with Python and BeautifulSoup)

Question

(PREFACE: I know, this problem has been talked about a hundred of times, but I still don't understand it)

I am trying to load a html-page and output the text, even though I am getting the webpage correctly, BeautifulSoup destroys somehow the encoding of accented characters which are not part of the first 127 ASCII-characters:

# -*- coding: utf-8 -*-
import sys
from urllib import urlencode
from urlparse import parse_qsl
import re
import urlparse
import json
import urllib
from bs4 import BeautifulSoup

url = "http://www.rtve.es/alacarta/interno/contenttable.shtml?ctx=29010&locale=es&module=&orderCriteria=DESC&pageSize=15&mode=TEXT&seasonFilter=40015"
html=urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
div = soup.find_all("span", class_="detalle")
capitulo_detalle = div[0].text   (doesn't work, capitulo_detalle is type str with utf-8, div[0].tex is type unicode)

Output of div[0].text should be something like:

Sátur se dirige al sur en busca de Estuarda y Gabi, pero un compañero de viaje inesperado hará que cambie de rumbo. Los hombres de Juan siguen presos. El enemigo comienza a realizar ejecuciones. Águila Roja tiene...

But the result I get is:

u'S\xe1tur se dirige al sur en busca de Estuarda y Gabi, pero un compa\xf1ero de viaje inesperado har\xe1 que cambie de rumbo. Los hombres de Juan siguen presos . El enemigo comienza a realizar ejecuciones. \xc1guila Roja tiene...'

--> What do I have to change to get the 'right' characters?

I know it must be a duplicate of these questions, but the answers doesn't seem to work here: Python and BeautifulSoup encoding issues How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

I also read the typical documentations about unicode, utf-8, ascii, e.g. https://docs.python.org/3/howto/unicode.html, obviously without success...

how do you run this code ? in Python Shell or `python script.py` ? How do you get this text ? Did you use `print div[0].text` or Python Shell printed this automatically for you ? You have correct text but Python Shell uses `print repr( div[0].text )` to show text usefull for debugging. So try `print repr(div[0].text)` and `print div[0].text` and you will see different text. — furas, Feb 01 '17 at 23:54
I am using Python 2.7.13, the example could be run within the Shell or as a script, doesn't matter. And yes, 'print' shows the correct output, but I need the text within a variable. — Tim Bremer, Feb 02 '17 at 08:40
You already have the right characters. The literal `u'S\xe1tur se dirige...'` exactly represents the text `Sátur se dirige...`. If you `print()` it you will see the raw characters (assuming your console can print them, which if it is Windows it might not). — bobince, Feb 02 '17 at 09:36
@bobince: Yes, but I am using Python 2.7.13 with utf-8. If I assign div[0].text (which is unicode) to a normal string variable (which is utf-8), I get in trouble. — Tim Bremer, Feb 02 '17 at 10:09
use only `unicode` and you will have no problem - so convert all strings to unicode. It is the solution. — furas, Feb 02 '17 at 20:18

score 0 · Accepted Answer · answered Feb 02 '17 at 22:51

I believe I finally got it...

>>> div = soup.find("span", class_="detalle")
>>> div.text
u'S\xe1tur se dirige al sur en busca de Estuarda y Gabi, pero

---> this is unicode, \xe1 is the 'code' for 'á' (http://www.utf8-chartable.de/unicode-utf8-table.pl?start=4096&number=128&names=-&utf8=string-literal)

>>> print(div.text)
Sátur se dirige al sur en busca de Estuarda y Gabi, pero

---> 'print' evaluates the unicode code point correctly

>>> div.text.encode('utf-8')
'S\xc3\xa1tur se dirige al sur en busca de Estuarda y Gabi, pero

---> Unicode is encoded to utf-8 according to the table given on the url cited above. I didn't understand why the output is shown as \xc3\xa1 and not as 'á'.

>>> print div.text.encode('utf-8')
S├ítur se dirige al sur en busca de Estuarda y Gabi, pero

---> and I didn't understand why print now evaluates it to a strange symbol....

>>> blurr = div.text.encode('cp850')
>>> blurr
'S\xa0tur se dirige al sur en busca de Estuarda y Gabi, pero
>>> type(blurr)
<type 'str'>

---> Unicode encoded to codepage 850, used within the python-shell under Windows

>>> print(blurr)
Sátur se dirige al sur en busca de Estuarda y Gabi, pero

---> Finally, it's right !!!

In Kodi I can use the utf-8 representation, so that e.g. the character 'á' is saved within the variable as \xc3\xa1, but when the content of the variable is displayed for example with "xbmcgui.Dialog().ok(addonname, blurr) it is shown correctly on the screen with an 'á'......

Und sowas soll man wissen......

score -1 · Answer 2 · answered Feb 02 '17 at 07:29

import requests
from bs4 import BeautifulSoup

url = "http://www.rtve.es/alacarta/interno/contenttable.shtml?ctx=29010&locale=es&module=&orderCriteria=DESC&pageSize=15&mode=TEXT&seasonFilter=40015"
html=requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
div = soup.find("span", class_="detalle")
capitulo_detalle = div.text

out:

'Sátur se dirige al sur en busca de Estuarda y Gabi, pero un compañero de viaje inesperado hará que cambie de rumbo. Los hombres de Juan siguen presos. El enemigo comienza a realizar ejecuciones. Águila Roja tiene...'

use requests and python3, and the problem will never show up

I also tried the example with 'requests', which doesn't change anything. Well, `python3` handles unicode a priori, but now I started with python 2.7.13 and I don't want to change the whole code, looking for inconsistencies. Furthermore I don't know if Kodi supports python3. — Tim Bremer, Feb 02 '17 at 08:42

Yet another encoding issue with accented characters (scraping a Website with Python and BeautifulSoup)

2 Answers2