0

I already tried all previous answers and solution.

I am trying to use this value, which gave me encoding related error.

ar = [u'http://dbpedia.org/resource/Anne_Hathaway', u'http://dbpedia.org/resource/Jodie_Bain', u'http://dbpedia.org/resource/Wendy_Divine', u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno', u'http://dbpedia.org/resource/Baaba_Maal']

So I tried,

d = [x.decode('utf-8') for x in ar]

which gives:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 31: ordinal not in range(128)

I tried out

 d = [x.encode('utf-8') for x in ar]

which removes error but changes the original content

original value was u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' which converted to 'http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno' while using encode

what is correct way to deal with this scenario?

Edit

Error comes when I feed these links in

req = urllib2.Request()
nlper
  • 2,297
  • 7
  • 27
  • 37
  • 1
    possible duplicate of [UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 20: ordinal not in range(128)](http://stackoverflow.com/questions/9942594/unicodeencodeerror-ascii-codec-cant-encode-character-u-xa0-in-position-20) – sirfz Jun 02 '15 at 09:26
  • what do you want to do with the data? ASCII encoding does not support those characters. That's why we have encodings such as uff-8. I'd highly advise on skipping ASCII if you plan on using this application out in the wild. – Sid Shukla Jun 02 '15 at 09:27
  • 3
    If you already have unicode strings, then you don't want to `decode()` them into unicode strings. :-) It's likely you want to interact with something that requires a non-unicode strings, which means putting it in an acceptable encoding via `encode()`. These days, that usually UTF-8, but it really depends on what you're trying to do and the service you're interacting with. – John Szakmeister Jun 02 '15 at 09:28
  • @SiddharthShukla: I store this links into my solr database, and later match it with `links` given from users. While dealing with user input link value, I get this issue. I dont want to change the way link looks – nlper Jun 02 '15 at 09:29
  • @niper In that case you would just want to go with the most common encoding: utf8. You could also just store it the way it is. – Sid Shukla Jun 02 '15 at 09:37

3 Answers3

2

The second version of your string is the correct utf-8 representation of your original unicode string. If you want to have a meaningful comparison, you have to use the same representation for both the stored string and the user input string. The sane thing to do here is to always use Unicode string internally (in your code), and make sure both your user inputs and stored strings are correctly decoded to unicode from their respective encodings at your system's boundaries (storage subsystem and user inputs subsystem).

Also you seem to be a bit confused about unicode and encodings, so reading this and this might help.

Community
  • 1
  • 1
bruno desthuilliers
  • 75,974
  • 6
  • 88
  • 118
2

Unicode strings in python are "raw" unicode, so make sure to .encode() and .decode() them as appropriate. Using utf8 encoding is considered a best practice among multiple dev groups all over the world. To encode use the quote function from the urllib2 library:

from urllib2 import quote
escaped_string = quote(unicode_string.encode('utf-8'))

To decode, use unquote:

from urllib2 import unquote
src = "http://dbpedia.org/resource/Jos\xc3\xa9_El\xc3\xadas_Moreno"
unicode_string = unquote(src).decode('utf-8')

Also, if you're more interested in Unicode and UTF-8 work, check out Unicode HOWTO and

Sid Shukla
  • 990
  • 1
  • 8
  • 33
0

In your Unicode list, u'http://dbpedia.org/resource/Jos\xe9_El\xedas_Moreno' is an ASCII safe way to represent a Unicode string. When encoded in a form that supports the full Western European character set, such as UTF-8, it's: http://dbpedia.org/resource/José_Elías_Moreno

Your .encode("UTF-8") is correct and would have looked ok in a UTF-8 editor or browser. What you saw after the encode was an ASCII safe representation of UTF-8.

For example, your trouble chars were é and í.

é = 00E9 Unicode = C3A9 UTF-8
í = 00ED Unicode = C3AD UTF-8

In short, your .encode() method is correct and should be used for writing to files or to a browser.

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100