Python 3 & BeautifulSoup Issue - [Decode error - output not utf-8]

Question

I'm facing some problems with beautifulsoup.

I'm trying to read the title of a couple website and when my code tries to read some sites that contains title with latin charactes I get this error:

[Decode error - output not utf-8]

Does someone knows how to solve this?

Cheers.

My code:

def getTitle(theList):

for element in theList:

    response = urllib.request.urlopen(element)
    soup = BeautifulSoup(response.read())
    title = soup.find("title").text
    print (element,": ",title,"\n")

score 1 · Answer 1 · edited May 23 '17 at 12:20

1

Try: How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?

It suggests soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))

edited May 23 '17 at 12:20

Community

1
1

answered Sep 03 '14 at 22:50

Celeo

5,583
8
39
41

Hi Celeo, thanks fo the answer., I alreayd tried, but I still get the same error. Any thoughts? – Alex.Six Sep 03 '14 at 23:01

score 0 · Answer 2 · answered Sep 03 '14 at 23:08

0

If the error is due to inconsitent encodings i.e., the html is mostly utf-8 (and BeautifulSoup detects it as such) but some characters (most notably: Microsoft smart quotes) are in different encoding; try UnicodeDammit.detwingle() method to make the document pure UTF-8 without discarding the incorrect characters that can be salvaged.

answered Sep 03 '14 at 23:08

jfs

399,953
195
994
1,670

Sebastian, I don't get it. Could you be more specific? Should I use UnicodeDammint.getwingle() inside BeautifulSoup(...)? – Alex.Six Sep 03 '14 at 23:40
@user3120065: click the link. It explicitely says when you should call `.detwingle()` – jfs Sep 03 '14 at 23:44

Python 3 & BeautifulSoup Issue - [Decode error - output not utf-8]

2 Answers2