I'm writing a crawler with Python using BeautifulSoup, and everything was going swimmingly till I ran into this site:
I'm getting the contents with the requests library:
r = requests.get('http://www.elnorte.ec/')
content = r.content
If I do a print of the content variable at that point, all the spanish special characters seem to be working fine. However, once I try to feed the content variable to BeautifulSoup it all gets messed up:
soup = BeautifulSoup(content)
print(soup)
...
<a class="blogCalendarToday" href="/component/blog_calendar/?year=2011&month=08&day=27&modid=203" title="1009 artÃculos en este dÃa">
...
It's apparently garbling up all the spanish special characters (accents and whatnot). I've tried doing content.decode('utf-8'), content.decode('latin-1'), also tried messing around with the fromEncoding parameter to BeautifulSoup, setting it to fromEncoding='utf-8' and fromEncoding='latin-1', but still no dice.
Any pointers would be much appreciated.