2

EDIT:

The following print shows my intended value.

(both sys.stdout.encoding and sys.stdin.encoding are 'UTF-8').

Why is the variable value different than its print value? I need to get the raw value into a variable.

>>username = 'Jo\xc3\xa3o'
>>username.decode('utf-8').encode('latin-1')
'Jo\xe3o'
>>print username.decode('utf-8').encode('latin-1')
João

Original question:

I'm having an issue querying a BD and decoding the values into Python.

I confirmed my DB NLS_LANG using

select property_value from database_properties where property_name='NLS_CHARACTERSET';

'''AL32UTF8 stores characters beyond U+FFFF as four bytes (exactly as Unicode defines 
UTF-8). Oracle’s “UTF8” stores these characters as a sequence of two UTF-16 surrogate
characters encoded using UTF-8 (or six bytes per character)'''

os.environ["NLS_LANG"] = ".AL32UTF8"

....
conn_data = str('%s/%s@%s') % (db_usr, db_pwd, db_sid)

sql = "select user_name apex.users where user_id = '%s'" % userid

...

cursor.execute(sql)
ldap_username = cursor.fetchone()
...

where

print ldap_username
>>'Jo\xc3\xa3o'

I've both tried (which return the same)

ldap_username.decode('utf-8')
>>u'Jo\xe3o'
unicode(ldap_username, 'utf-8')
>>u'Jo\xe3o'

where

u'João'.encode('utf-8')
>>'Jo\xc3\xa3o'

how to get the queries result back to the proper 'João' ?

Joao Figueiredo
  • 3,120
  • 3
  • 31
  • 40

1 Answers1

2

You already have the proper 'João', methinks. The difference between >>> 'Jo\xc3\xa3o' and >>> print 'Jo\xc3\xa3o' is that the former calls repr on the object, while the latter calls str (or probably unicode, in your case). It's just how the string is represented.

Some examples might make this more clear:

>>> print 'Jo\xc3\xa3o'.decode('utf-8')
João
>>> 'Jo\xc3\xa3o'.decode('utf-8')
u'Jo\xe3o'
>>> print repr('Jo\xc3\xa3o'.decode('utf-8'))
u'Jo\xe3o'

Notice how the second and third result are identical. The original ldap_username currently is an ASCII string. You can see this on the Python prompt: when it is displaying an ACSII object, it shows as 'ASCII string', while Unicode objects are shown as u'Unicode string' -- the key being the leading u.

So, as your ldap_username reads as 'Jo\xc3\xa3o', and is an ASCII string, the following applies:

>>> 'Jo\xc3\xa3o'.decode('utf-8')
u'Jo\xe3o'
>>> print 'Jo\xc3\xa3o'.decode('utf-8') # To Unicode...
João
>>> u'João'.encode('utf-8')             # ... back to ASCII
'Jo\xc3\xa3o'

Summed up: you need to determine the type of the string (use type when unsure), and based on that, decode to Unicode, or encode to ASCII.

jro
  • 9,300
  • 2
  • 32
  • 37
  • Thanks jro. Though I get exactly the same results on your 2nd and 3d examples, on your first I get: João, not João. How can I get that raw value 'João' as stored in the DB into a Python object ? – Joao Figueiredo Oct 24 '11 at 16:20
  • @JoaoFigueiredo: I updated the answer to address your additional question. – jro Oct 24 '11 at 17:41
  • I apologize if I haven't been clear. I think I grasped the basic principles of decoding and encoding (u'string' doesn't leave any doubt about its type). My issue keeps being how to pass the raw string to an external API. – Joao Figueiredo Oct 25 '11 at 09:16
  • As a temporary workaround I'm normalizing the strings, unicodedata.normalize('NFKD', ldap_username.decode('utf-8') ).encode('ascii', 'ignore') which normalizes 'João' to 'Joao', 'Lourenço' to 'Lourenco', etc – Joao Figueiredo Oct 25 '11 at 09:17
  • I'm not sure what the problem is. You ask why _"the variable value [is] different than its print value"_: this is simply the difference in representing the same string. `str` prints the string in a "pretty" format; `repr` prints the string in a format that can be used to reconstruct the object (using `eval`). Run the following commands, maybe that clears things up: `repr('Jo\xc3\xa3o')`, `eval('Jo\xc3\xa3o')`, `eval(repr('Jo\xc3\xa3o'))`. Finally, as to the issue of the external API, I'd need some error messages to say anything about it. – jro Oct 25 '11 at 09:31
  • Thanks for the patience. As you pointed out, I already have the correct representation of the original string. My confusion was thinking there was some way to obtain that original raw string from the unicode decoded string. – Joao Figueiredo Oct 25 '11 at 14:09