13

What is the best way to decode an encoded string that looks like: u'u\xf1somestring' ?

Background: I have a list that contains random values (strings and integers), I'm trying to convert every item in the list to a string then process each of them.

Turns out some of the items are of the format: u'u\xf1somestring' When I tried converting to a string, I get the error: UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 1: ordinal not in range(128)

I have tried

item = u'u\xf1somestring'
decoded_value = item.decode('utf-8', 'ignore')

However, I keep getting the same error.

I have read up about unicode characters and tried a number of suggestions from SO but none have worked so far. Am I missing something here?

mfalade
  • 1,647
  • 2
  • 17
  • 16
  • If it's a Unicode string, it's already decoded. – RemcoGerlich Jan 29 '16 at 11:31
  • You may find this article helpful: [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html), which was written by SO veteran Ned Batchelder. – PM 2Ring Jan 29 '16 at 11:32
  • I assume you're using Python 2. You should **always** mention the Python version with Unicode questions (preferably with the appropriate tag) because Python 2 & Python 3 handle Unicode rather differently. – PM 2Ring Jan 29 '16 at 11:33
  • FWIW, `s = u'u\xf1somestring'.encode('utf-8');print s, repr(s)` prints `uñsomestring 'u\xc3\xb1somestring'` – PM 2Ring Jan 29 '16 at 11:37

2 Answers2

16

You need to call encode function and not decode function, as item is already decoded.

Like this:

decoded_value = item.encode('utf-8')
Sameer Mirji
  • 2,135
  • 16
  • 28
  • 1
    You *decode* to Unicode, *encode* to byte strings. – Mark Tolonen Jan 29 '16 at 17:01
  • @MarkTolonen: So what part of my answer did you find wrong here? I've specifically used code blocks to indicate I was taking about the method names here. – Sameer Mirji Jan 30 '16 at 04:24
  • 1
    The string is already decoded if it is a Unicode string. `item.encode('utf-8')` makes an `encoded_value`. You (and the OP) have the terminology confused. – Mark Tolonen Jan 30 '16 at 07:27
3

That string already is decoded (it's a Unicode object). You need to encode it if you want to store it in a file (or send it to a dumb terminal etc.).

Generally, when working with Unicode, you should (in Python 2) decode all your strings early in the workflow (which you already seem to have done; many libraries that handle internet traffic will already do that for you), then do all your work on Unicode objects, and then at the very end, when writing them back, encode them to whatever encoding you're using.

Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561