2

I am trying to grab an article from the web and write it to the database.

If I do this

article = article.decode('utf-8')

I get this:

'ascii' codec can't decode byte 0xc3 in position 25729: ordinal not in range(128)

If I do this

article = article.encode('utf-8')

I get this:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5409: ordinal not in range(128)

If I do this

article = article.encode('utf-8').decode()

or this

article = article.decode().encode('utf-8')

I still get this

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5409: ordinal not in range(128)

Questions:

Any help will be greatly appreciated on solving this problem!

EDIT: Stackoverflow recommended an article that said do .encode('utf-8') as per the above, this doesn't work, the error persists.

Manish
  • 249
  • 1
  • 9
  • 19
  • Are you able to upgrade to Python 3.x? What is the type of `article`? – Simeon Visser Dec 25 '15 at 17:48
  • I could potentially upgrade to 3.x, would need to figure out how (I'm not quite a newbie but almost).... the article is a blog post about electro house music, a tips and tricks post. – Manish Dec 25 '15 at 17:54
  • I downloaded python3, and added the shebang #!/usr/bin/python3 to my script, and continue to get the error UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5409: ordinal not in range(128) – Manish Dec 25 '15 at 17:58
  • Could you try (u' ' + article).encode('utf-8') ? – Shirkrin Dec 25 '15 at 18:18
  • Tried (u' ' + article).encode('utf-8') and got the result UnicodeDecodeError: 'ascii' codec can't decode byte 0xc4 in position 5409: ordinal not in range(128) – Manish Dec 25 '15 at 18:20
  • Or since you're in python 2.7/3 unicode(article), with should give you a real unicode object to work with. – Shirkrin Dec 25 '15 at 18:20
  • A link to the article might also help ;) – Shirkrin Dec 25 '15 at 18:21
  • OK, your edit clarified the case... I've reopened it. Regards. – Bhargav Rao Dec 25 '15 at 19:29
  • note sure what you mean by "give a real unicode object' @Shirkrin, and link to article: http://www.soundstosample.com/blog/pro-tips/electro-production-tips – Manish Dec 25 '15 at 19:34
  • You need to figure out what encoding the article is in and then figure out what encoding your database API expects, and then use an appropriate method to transcode the strings. No one can give you a piece of code that just magically works universally. Hex C4 is not a valid ASCII character. – David Grayson Dec 25 '15 at 19:35
  • 1
    You can't just start throwing `encode` and `decode` around and expect things to work - you need to *understand* what you're doing. To start with, not all web pages have the same encoding so you have to deal with that, unless you're using a package that already decodes the page to Unicode strings. You need to know what you're starting with and what you need to end with! – Mark Ransom Dec 25 '15 at 19:42
  • Can someone please share how to detect the encoding of the website in the first place? I tried chardet and it's not quite working out properly... – Manish Dec 25 '15 at 19:45
  • btw, I checked the source of the website and it's charset = UTF-8. This is very frustrating.... – Manish Dec 25 '15 at 19:50
  • Possible duplicate of [How to correctly parse UTF-8 encoded HTML to Unicode strings with BeautifulSoup?](http://stackoverflow.com/questions/20205455/how-to-correctly-parse-utf-8-encoded-html-to-unicode-strings-with-beautifulsoup) – Foobie Bletch Dec 25 '15 at 20:27
  • @Manish It has not been marked as a duplicate; someone has suggested it is a duplicate. You need to slow down and do some reading. "Smart people" have figured how to make this easy but it requires users to understand what they're doing. You need post more code if you want help, as we don't know what kind of object `article` is. Further, please ask just one question at a time. – Alastair McCormack Dec 26 '15 at 10:07
  • Further, you have two exceptions from two different places, both without sources to back it up. Please remove the second one – Alastair McCormack Dec 26 '15 at 10:12
  • @alastair-mccormack Thanks for your feedback. To clarify, Bhargav Rao did in fact close the question after marking it as a duplicate, it was not until I sent a tweet to this user that it was reopened. I actually would like to mark my own response as the accepted but am precluded from doing so b/c I need to wait another 12 hours first. I will remove the second error per your advice. Have a happy holiday. – Manish Dec 27 '15 at 05:11
  • @Manish I apologise - I didn't see that it was closed. I would advise you to try glglgl or my answers before accepting your own. Your answer is a lossy fix-all conversion. Happy holidays :) – Alastair McCormack Dec 27 '15 at 09:14
  • @alastair-mccormack yes I realize my own answer is a bad fix, b/c it opens the door to new errors. Based on that, what do you recommend is the best answer from the options below? – Manish Dec 27 '15 at 19:04
  • @manish Have you tried my suggestion to use Requests to get a Unicode to pass to BeautifulSoup? – Alastair McCormack Dec 27 '15 at 21:26
  • 1
    you can't just call `.encode()`/`.decode()` on arbitrary Python objects and hope that it works. Your question is missing `type(article)` at the very least. Provide [minimal but complete code example that shows your issue](http://stackoverflow.com/help/mcve) and the corresponding full traceback (copy-paste as is). – jfs Dec 28 '15 at 03:27

4 Answers4

2

Unicode is not a pain if you know what you do.

If we try a more systematic approach, and suppose we stay with Python 2.x, we have to understand that everything we get from the web etc. consists of bytes and thus is a str.

On a str, we can only call .decode(), on a unicode object, we can only call .encode(). (This is not completely true, but if we don't follow this, we lose control over the internal de-/encoding which happens to compensate for this mismatch.)

Example: If you do

article = article.encode('utf-8')

you get a UnicodeDecodeError which says that 'ascii' codec can't decode byte 0xc4 in position 5409: ordinal not in range(128)

We see that although we call .encode(), a decode error happens first. This is because there is an implicit call to .decode('ascii') which fails because there are non-ASCII bytes in the str.

However, I don't understand why

article = article.decode('utf-8')

gives a

'ascii' codec can't decode byte 0xc3 in position 25729: ordinal not in range(128)

because the ascii codec isn't used at all here. Maybe you could edit your question and add the output of print repr(article) before this .decode() call so that we can try to replicate this.

glglgl
  • 89,107
  • 13
  • 149
  • 217
  • My guess is that `article` is already a Unicode object, which would give that exception. How the OP got a Unicode is anyone's guess. Let's hope the scattergun is put away now ;) – Alastair McCormack Dec 26 '15 at 11:09
  • @AlastairMcCormack Well, if it was a Unicode object, it would give an *encode* error. Anyway, you are right with the guessing. That's why I asked for the repr of that object; as soon as we have it, we'll know more. – glglgl Dec 26 '15 at 11:22
0

Reading between the lines of the question and the OP's own answer, it looks like the encoding of the original web page is not being handled.

The web page needs to be correctly decoded. This can be achieved by inspecting the Content-type: header or use a HTTP library which does it for you. Requests module does this for you and returns a decoded Unicode object. This object can then be passed to TextWrappers (via io.open())for writing to a file, to a database handler, or to BeautifulSoup for parsing. In fact, BeautifulSoup should only be passed Unicode strings.

Example using Requests:

response = requests.get(url)

# A decoded Unicode object
response_body_unicode = response.text

soup = BeautifulSoup(response_body_unicode)
Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
  • there is no need to use `requests` to decode an html page. [`BeautifulSoup` handles it (better) already.](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#encodings) – jfs Dec 28 '15 at 03:25
  • True, except for at least iso-8895-15 encoded files in my small tests, where BS confused it with windows-1252. However, I guess it's more likely that BeautifulSoup will be guess the right encoding than the Content-type reported by the HTTP Server being correct. – Alastair McCormack Dec 28 '15 at 20:46
  • There is no silver bullet: the server may lie about the encoding, the content may pull data from different sources that use different encodings, [smart quotes case was common](http://www.crummy.com/software/BeautifulSoup/bs4/doc/#inconsistent-encodings). How to handle cases when `response.encoding != soup.original_encoding` depends on the application. – jfs Dec 28 '15 at 20:52
-2

It looks like the workaround is to add this to the code

html = unicode(html, errors='ignore')

so the full code to get the article looks like this

def getArticle(url):
    br = mechanize.Browser()
    br.set_handle_robots(False)
    br.addheaders = [("User-agent","Mozilla/5.0")] #our identity in the web
    html = br.open(url).read()

    html = unicode(html, errors='ignore')

    readable_article = Document(html).summary()
    readable_title = Document(html).short_title()

    soup = BeautifulSoup(readable_article)

    soup_title = BeautifulSoup(readable_title)

    final_article = soup.text
    final_title = soup_title.text

    links = soup.findAll('img', src=True)

    return html, final_article, final_title, links
Manish
  • 249
  • 1
  • 9
  • 19
  • 1
    That'll work for sure and also probably replace most if not all non ascii characters with blanks or '?' - just a side note here. If that's ok for you ignore it. – Shirkrin Dec 25 '15 at 21:26
  • @Shirkrin given that nothing else is working, not sure I have any other options... – Manish Dec 25 '15 at 22:55
  • Agreed. One other thing though - what happens if you try to insert this u'äöü' as text in your database? (if it's not the input html maybe the database is trying to force convert to ascii) – Shirkrin Dec 26 '15 at 08:50
-3

Try using utf 16 instead. This tends to help solve my problems when the above occurs.

     .decode('utf-16')
J damne
  • 72
  • 7
  • 2
    This is just guessing. This will, of course, only work if the variable indeed has this format. – glglgl Dec 26 '15 at 10:15