I am trying to parse this document with Python and BeautifulSoup:
http://en.wikipedia.org/w/api.php?format=xml&action=opensearch&search=rage_against_the_machine
The seventh Item down as this Text tag:
Rage Against the Machine's 1994–1995 Tour
When I try to print out the text "Rage Against the Machine's 1994–1995 Tour", python is giving me this:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 31: ordinal not in range(128)
I can resolve it by simply replacing u'\u2013' with '-' like so:
itemText = itemText.replace(u'\u2013', '-')
However what about every character that I have not coded for? I do not want to ignore them nor do I want to list out every possible find and replace.
Surely a library must exist to try it's very best to detect the encoding from a list of common known encoding's (however likely it is to get it wrong).
someText = getTextWithUnknownEncoding(someLocation);
bestAsciiAttemptText = someLibrary.tryYourBestToConvertToAscii(someText)
Thank you