Python UnicodeEncodeError / Wikipedia-API

Question

I am trying to parse this document with Python and BeautifulSoup:

http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=rage_against_the_machine

The seventh Item down as this Text tag:

Rage Against the Machine's 1994–1995 Tour

When I try to print out the text "Rage Against the Machine's 1994–1995 Tour", python is giving me this:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 31: ordinal not in range(128)

I can resolve it by simply replacing u'\u2013' with '-' like so:

itemText = itemText.replace(u'\u2013', '-')

However what about every character that I have not coded for? I do not want to ignore them nor do I want to list out every possible find and replace.

Surely a library must exist to try it's very best to detect the encoding from a list of common known encoding's (however likely it is to get it wrong).

someText = getTextWithUnknownEncoding(someLocation);
bestAsciiAttemptText = someLibrary.tryYourBestToConvertToAscii(someText)

Thank you

*How* are you printing the text? To a terminal, to a file? Are you concatenating (str1 + str2) anywhere? — Martijn Pieters, Nov 17 '12 at 17:38
Does [How can I display native accents to languages in console in windows?](http://stackoverflow.com/q/3473166) help? — Martijn Pieters, Nov 17 '12 at 17:46
I am on Windows 7 and I am printing directly to the terminal — szxnyc, Nov 17 '12 at 21:22
That link suggests to use .encode('utf-8') which does not work. It gives me the same exact error. — szxnyc, Nov 17 '12 at 21:40
It also tells you that printing UTF-8 to the windows console is tricky. Did you follow the rest of the instructions? — Martijn Pieters, Nov 17 '12 at 21:41
Possible duplicate of [How can I display native accents to languages in console in windows?](http://stackoverflow.com/q/3473166) — Martijn Pieters, Nov 17 '12 at 21:51
Setting my console to use utf-8 via chcp 65001 or setting the font to Lucida Console does not change the behavior. I still get the error. — szxnyc, Nov 17 '12 at 21:54
Not having Windows myself, I have no further hints for you on how to solve this; all I know is that getting UTF-8 to work in Windows consoles is needlessly difficult, and the linked answer is the only information I have for you. — Martijn Pieters, Nov 17 '12 at 22:00
Thank you for all the suggestions Martijn. I'll keep trying and hopefully I'll find something. If I do I'll be sure to post the answer here. — szxnyc, Nov 18 '12 at 05:05

score 1 · Answer 1 · answered Nov 17 '12 at 17:36

1

Decoding it as UTF-8 should work:

itemText = itemText.decode('utf-8')

answered Nov 17 '12 at 17:36

Eric

95,302
53
242
374

1

Normally, Python detects the terminal codec. Encoding blindly to UTF-8 is not going to help here. – Martijn Pieters Nov 17 '12 at 17:38

score 0 · Answer 2 · edited May 23 '17 at 12:11

Normally, you should try to preserve characters as unicode or utf-8. Avoid converting characters to your local codepage, as this results in loss of information.

However, if you must, here are. Few things to do. Let's use your example character:

>>> s = u'\u2013'

If you want to print the string e.g. for debugging, you can use repr:

>>> print(repr(s))
u'\u2013'

In an interactive session, you can just type the variable name to achieve the same result:

>>> s
u'\u2013'

If you really want to convert it the text to your local codepage, and it is OK that characters outside this codepage are converted to '?', you can use this:

>>> s.encode('latin-1', 'replace')
'?'

If '?' is not good enough, you can use translate to convert selected characters into an equivalent character as in this answer.

score -2 · Answer 3 · answered Nov 17 '12 at 17:38

-2

You may need to explicitly declare your encoding.

On the first line of your file (or after the hashbang, if there is one), add the following line:

-*- coding: utf-8 -*-

This 'magic comment' forces Python to expect UTF-8 characters and should decode them successfully.

More details: http://www.python.org/dev/peps/pep-0263/

answered Nov 17 '12 at 17:38

Cal McLean

1,408
8
15

1

The comment only applies to *reading the source code*, and has nothing to do with output encodings. – Martijn Pieters Nov 17 '12 at 17:39
This does not change the behavior of the source. – szxnyc Nov 17 '12 at 21:28

Python UnicodeEncodeError / Wikipedia-API

3 Answers3