1

I have been plugging away at this for hours and I just can't seem to get to the bottom of it. I have been through this website in detail and although others seem to have a similar problem but their solutions given just don't work for me.

I have a python script which reads the HTML of a website and uses beautiful soup to find things like the head, body, H1's etc... and then store them in a utf-8 MySQL table.

Seems straight forward but I keep running into:

UnicodeDecodeError: 'ascii' codec can't decode byte xxxxxx

When I encode. I have tried everything I can find to stop this happening but to no avail. Here is one version of the code:

soup = BeautifulSoup(strIndexPage)
strIndexPageBody = str(soup.body)
strIndexPageBody = strIndexPageBody.encode('ascii', 'ignore') # I know ignore is not best practice but I am really not interested in anything outside the ascii character set
strIndexPageBody = strIndexPageBody .replace('"','"')
strIndexPageBody = strIndexPageBody .replace("'","&rsquo")

An earlier version where I tried to convert to utf-8 works better, but I end up with the

` 

character present in some of the HTML which breaks the MySQL insert/update. Obviously I have tried searching for this character and replacing it, but then python tells be I have a non ascii character in my code!

I have read tons are articles that say I should be looking at the encoding for the HTML first, decode and then encode to suit, but the encoding does not always come back from BS, and/or not declared within the HTML.

I am sure there is a simple way around this but I can't find it.

Thanks for any help.

dan360
  • 361
  • 2
  • 16
  • 1
    Shouldn't `&rsquo` end in a semicolon? Also it's not the same as `'`. – Mark Byers Nov 10 '11 at 23:22
  • Please stop focusing on the last two lines - they are not where the error is. It errors on the encoding as the error message suggests. – dan360 Nov 10 '11 at 23:26
  • When Python complains about a non-ascii character in your code, it probably means you need to add a `# coding: utf-8` magic comment at the top (it needs to be one of the first two lines). That's assuming you're saving the Python file in UTF-8. – Thomas K Nov 10 '11 at 23:45
  • Very similar to http://stackoverflow.com/questions/5236437/python-unicodeencodeerror-how-can-i-simply-remove-troubling-unicode-characters – Mark Byers Nov 10 '11 at 23:46
  • Interesting - will give it a shot tomorrow - thanks for your input. – dan360 Nov 10 '11 at 23:47
  • Upvote for Mark Byers who also helped. Thank you. – dan360 Nov 11 '11 at 11:48

2 Answers2

6

Note that you're getting a decode error from a call to encode. This is the ugliest part of Python 2: it lets you try to encode a string that is already encoded, by first decoding it as ascii. What you're doing is equivalent to this:

s.decode('ascii', 'strict').encode('ascii', 'ignore')

I think this should do what you expect:

soup = BeautifulSoup(strIndexPage)
strIndexPageBody = unicode(soup.body)
strIndexPageBody = strIndexPageBody.encode('ascii', 'ignore')

Note that we're calling unicode, so we get a unicode string that we can validly try to encode.

Thomas K
  • 39,200
  • 7
  • 84
  • 86
  • If `soup.body` is a `str` object which encodes non-ascii characters, then passing it to `unicode` will give a `UnicodeDecodeError`; on the other hand, if it's already a unicode object, then passing it to `unicode` is redundant. – ekhumoro Nov 11 '11 at 00:19
  • @ekhumoro: Its a `BeautifulSoup.Tag` object, which can be flattened with either `str` or `unicode`. – Thomas K Nov 11 '11 at 00:24
  • Yeah, sorry - I probably should have guessed that :/ – ekhumoro Nov 11 '11 at 00:55
  • @Thomas K: Thank you very much for your help - Your explanation is succinct and I now understand where I was going wrong. – dan360 Nov 11 '11 at 11:47
2

BeautifulSoup's UnicodeDammit should be able to detect the encoding of a document even when it isn't specified.

What happens when you run this on the page in question?:

from BeautifulSoup import UnicodeDammit

UnicodeDammit(html_string).unicode

What specific line of code is throwing the error and can we have a sample of problematic HTML?

Acorn
  • 49,061
  • 27
  • 133
  • 172
  • I Skimmed over that earlier - I will give it a try and report back - thanks for your help. – dan360 Nov 10 '11 at 23:05
  • The thing is, UnicodeDammit is by default when parsing a page with BeautifulSoup, you shouldn't have to do anything special. – Acorn Nov 10 '11 at 23:06
  • I see - BS does not error - the error occurs when I try to encode it. – dan360 Nov 10 '11 at 23:09
  • If you're encoding unicode to ascii, and you're setting it to ignore characters that can't be encoded, it shouldn't be raising `UnicodeDecodeError` exceptions. What is the line of code that raises the exception and what is the object being encoded? – Acorn Nov 11 '11 at 00:07