(PREFACE: I know, this problem has been talked about a hundred of times, but I still don't understand it)
I am trying to load a html-page and output the text, even though I am getting the webpage correctly, BeautifulSoup destroys somehow the encoding of accented characters which are not part of the first 127 ASCII-characters:
# -*- coding: utf-8 -*-
import sys
from urllib import urlencode
from urlparse import parse_qsl
import re
import urlparse
import json
import urllib
from bs4 import BeautifulSoup
url = "http://www.rtve.es/alacarta/interno/contenttable.shtml?ctx=29010&locale=es&module=&orderCriteria=DESC&pageSize=15&mode=TEXT&seasonFilter=40015"
html=urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
div = soup.find_all("span", class_="detalle")
capitulo_detalle = div[0].text (doesn't work, capitulo_detalle is type str with utf-8, div[0].tex is type unicode)
Output of div[0].text
should be something like:
Sátur se dirige al sur en busca de Estuarda y Gabi, pero un compañero de viaje inesperado hará que cambie de rumbo. Los hombres de Juan siguen presos. El enemigo comienza a realizar ejecuciones. Águila Roja tiene...
But the result I get is:
u'S\xe1tur se dirige al sur en busca de Estuarda y Gabi, pero un compa\xf1ero de viaje inesperado har\xe1 que cambie de rumbo. Los hombres de Juan siguen presos . El enemigo comienza a realizar ejecuciones. \xc1guila Roja tiene...'
--> What do I have to change to get the 'right' characters?
I know it must be a duplicate of these questions, but the answers doesn't seem to work here: Python and BeautifulSoup encoding issues How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?
I also read the typical documentations about unicode, utf-8, ascii, e.g. https://docs.python.org/3/howto/unicode.html, obviously without success...