0

I'm facing some problems with beautifulsoup.

I'm trying to read the title of a couple website and when my code tries to read some sites that contains title with latin charactes I get this error:

[Decode error - output not utf-8]

Does someone knows how to solve this?

Cheers.

My code:

def getTitle(theList):

for element in theList:

    response = urllib.request.urlopen(element)
    soup = BeautifulSoup(response.read())
    title = soup.find("title").text
    print (element,": ",title,"\n")
Alex.Six
  • 37
  • 7

2 Answers2

1

Try: How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

It suggests soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

Community
  • 1
  • 1
Celeo
  • 5,583
  • 8
  • 39
  • 41
0

If the error is due to inconsitent encodings i.e., the html is mostly utf-8 (and BeautifulSoup detects it as such) but some characters (most notably: Microsoft smart quotes) are in different encoding; try UnicodeDammit.detwingle() method to make the document pure UTF-8 without discarding the incorrect characters that can be salvaged.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Sebastian, I don't get it. Could you be more specific? Should I use UnicodeDammint.getwingle() inside BeautifulSoup(...)? – Alex.Six Sep 03 '14 at 23:40
  • @user3120065: click the link. It explicitely says when you should call `.detwingle()` – jfs Sep 03 '14 at 23:44