How to convert a unicode character representation from string to unicode in python?

Question

Ok I've found a lot of threads about how to convert a string from something like "/xe3" to "ã" but how the hell am I supposed to do it the other way around?

My concrete problem: I am using an API and everything works great except I provide some strings which then result in a json object. The result is sorted after the names (strings) I provided however they are returned as their unicode representation and as json APIs always work in pure strings. So all I need is a way to get from "ã" to "/xe3" but it can't for the love of god get it to work.

Every type of encoding or decoding I try either defaults back to a normal string, a string without that character, a string with a plain A or an unicode error that ascii can't decode it. (<- this was due to a horrible shell setup. Yay for old me.)

All I want is the plain encoded string! (yea no not at all past me. All you want is the unicode representation of a character as string)

PS: All in python if that wasn't obvious from the title already.

Edit: Even though this is quite old I wanted to update this to not completely embarrass myself in the future.

The issue was an API which provided unicode representations of characters as string as a response. All I wanted to do was checking if they are the same however I had major issues getting python to interpret the string as unicode especially since those characters were just some inside of a longer text partially with backslashes.

This did help but I just stumbled across this horribly written question and just couldn't leave it like that.

From your problem description it is obvious that you did not entirely understand the concepts of byte strings vs sequences of unicode code points. I suggest having a deep read about this topic, it pays off. Then you need to come back and ask your question more precisely, and also show a piece of code. — Dr. Jan-Philip Gehrcke, Nov 20 '14 at 12:16
I would kinda agree that I probably did not grasp everything but I don't think I am missing that much. My main issue is that I get everything in the "readable" version. But I can't seem to understand how I get the "plain" version. Basically I want to get from "ã" to "\xe3". Not more. Not less. I tried to read up but I'm at a point where I just gave up completely. — Erasio, Nov 20 '14 at 12:23
possible duplicate of [Python - Unicode to ASCII conversion](http://stackoverflow.com/questions/19527279/python-unicode-to-ascii-conversion) — tripleee, Nov 20 '14 at 12:43
Please read http://www.joelonsoftware.com/articles/Unicode.html — jsbueno, Nov 20 '14 at 12:47

score 2 · Accepted Answer · answered Nov 20 '14 at 12:52

"\xe3" in python is a string literal that represents a single byte with value 227:

>>> print len("\xe3")
1
>>> print ord("\xe3")
227

This single byte represents the 'ã' character in the latin-1 encoding (http://en.wikipedia.org/wiki/ISO/IEC_8859-1).

"ã" in python is a string literal consisting of two bytes: 0xC3, 0xA3 (195, 163):

>>> print len("ã")
2
>>> print ord("ã"[0])
195
>>> print ord("ã"[1])
163

This byte sequence is the UTF-8 encoding of the character "ã".

So, to go from "ã" in python to "\xe3", you first need to decode the utf-8 byte sequence into a python unicode string:

>>> "ã".decode("utf-8")
u'\xe3'

Now, you can take that unicode string and encode it however you like (e.g. into latin-1):

>>> "ã".decode("utf-8").encode("latin-1")
'\xe3'

score 0 · Answer 2 · answered Nov 20 '14 at 12:53

Please read http://www.joelonsoftware.com/articles/Unicode.html . You should realize tehre is no such a thing as "a plain encoded string". There is "an encoded string in a given text encoding". So you are really in need to understand the better the concepts of Unicode.

Among other things, this is plain wrong: "The result is sorted after the names (strings) I provided however they are returned in encoded form." JSON uses Unicode, so you get the string in a decoded form.

jar · Answer 3 · 2014-11-20T21:28:44.340

Since I assume you are, perhaps unknowingly, working with UTF-8, you should be aware that \xe3 is the Unicode code point for the character ã. Not to be mistaken for the actual bytes that UTF-8 uses to reference that code point:

http://hexutf8.com/?q=U+e3

I.e. UTF-8 maps the byte sequence c3 a3 to the code point U+e3 which represents the character ã.

UTF-16 maps a different byte sequence, 00 e3 to that exact same code point. (Note how much simpler, but less space efficient the UTF-16 encoding is...)

How to convert a unicode character representation from string to unicode in python?

3 Answers3