2

I am trying to parse this document with Python and BeautifulSoup:

http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=rage_against_the_machine

The seventh Item down as this Text tag:

Rage Against the Machine's 1994–1995 Tour

When I try to print out the text "Rage Against the Machine's 1994–1995 Tour", python is giving me this:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 31: ordinal not in range(128)

I can resolve it by simply replacing u'\u2013' with '-' like so:

itemText = itemText.replace(u'\u2013', '-')

However what about every character that I have not coded for? I do not want to ignore them nor do I want to list out every possible find and replace.

Surely a library must exist to try it's very best to detect the encoding from a list of common known encoding's (however likely it is to get it wrong).

someText = getTextWithUnknownEncoding(someLocation);
bestAsciiAttemptText = someLibrary.tryYourBestToConvertToAscii(someText)

Thank you

szxnyc
  • 2,495
  • 5
  • 35
  • 46
  • Are you on Windows, by any chance? – Martijn Pieters Nov 17 '12 at 17:35
  • *How* are you printing the text? To a terminal, to a file? Are you concatenating (str1 + str2) anywhere? – Martijn Pieters Nov 17 '12 at 17:38
  • Does [How can I display native accents to languages in console in windows?](http://stackoverflow.com/q/3473166) help? – Martijn Pieters Nov 17 '12 at 17:46
  • I am on Windows 7 and I am printing directly to the terminal – szxnyc Nov 17 '12 at 21:22
  • That link suggests to use .encode('utf-8') which does not work. It gives me the same exact error. – szxnyc Nov 17 '12 at 21:40
  • It also tells you that printing UTF-8 to the windows console is tricky. Did you follow the rest of the instructions? – Martijn Pieters Nov 17 '12 at 21:41
  • Possible duplicate of [How can I display native accents to languages in console in windows?](http://stackoverflow.com/q/3473166) – Martijn Pieters Nov 17 '12 at 21:51
  • Setting my console to use utf-8 via chcp 65001 or setting the font to Lucida Console does not change the behavior. I still get the error. – szxnyc Nov 17 '12 at 21:54
  • Not having Windows myself, I have no further hints for you on how to solve this; all I know is that getting UTF-8 to work in Windows consoles is needlessly difficult, and the linked answer is the only information I have for you. – Martijn Pieters Nov 17 '12 at 22:00
  • Thank you for all the suggestions Martijn. I'll keep trying and hopefully I'll find something. If I do I'll be sure to post the answer here. – szxnyc Nov 18 '12 at 05:05

3 Answers3

1

Decoding it as UTF-8 should work:

itemText = itemText.decode('utf-8')
Eric
  • 95,302
  • 53
  • 242
  • 374
0

Normally, you should try to preserve characters as unicode or utf-8. Avoid converting characters to your local codepage, as this results in loss of information.

However, if you must, here are. Few things to do. Let's use your example character:

>>> s = u'\u2013'

If you want to print the string e.g. for debugging, you can use repr:

>>> print(repr(s))
u'\u2013'

In an interactive session, you can just type the variable name to achieve the same result:

>>> s
u'\u2013'

If you really want to convert it the text to your local codepage, and it is OK that characters outside this codepage are converted to '?', you can use this:

>>> s.encode('latin-1', 'replace')
'?'

If '?' is not good enough, you can use translate to convert selected characters into an equivalent character as in this answer.

Community
  • 1
  • 1
oefe
  • 19,298
  • 7
  • 47
  • 66
-2

You may need to explicitly declare your encoding.

On the first line of your file (or after the hashbang, if there is one), add the following line:

-*- coding: utf-8 -*-

This 'magic comment' forces Python to expect UTF-8 characters and should decode them successfully.

More details: http://www.python.org/dev/peps/pep-0263/

Cal McLean
  • 1,408
  • 8
  • 15