0

Ok I've found a lot of threads about how to convert a string from something like "/xe3" to "ã" but how the hell am I supposed to do it the other way around?

My concrete problem: I am using an API and everything works great except I provide some strings which then result in a json object. The result is sorted after the names (strings) I provided however they are returned as their unicode representation and as json APIs always work in pure strings. So all I need is a way to get from "ã" to "/xe3" but it can't for the love of god get it to work.

Every type of encoding or decoding I try either defaults back to a normal string, a string without that character, a string with a plain A or an unicode error that ascii can't decode it. (<- this was due to a horrible shell setup. Yay for old me.)

All I want is the plain encoded string! (yea no not at all past me. All you want is the unicode representation of a character as string)

PS: All in python if that wasn't obvious from the title already.

Edit: Even though this is quite old I wanted to update this to not completely embarrass myself in the future.

The issue was an API which provided unicode representations of characters as string as a response. All I wanted to do was checking if they are the same however I had major issues getting python to interpret the string as unicode especially since those characters were just some inside of a longer text partially with backslashes.

This did help but I just stumbled across this horribly written question and just couldn't leave it like that.

Erasio
  • 43
  • 2
  • 8
  • From your problem description it is obvious that you did not entirely understand the concepts of byte strings vs sequences of unicode code points. I suggest having a deep read about this topic, it pays off. Then you need to come back and ask your question more precisely, and also show a piece of code. – Dr. Jan-Philip Gehrcke Nov 20 '14 at 12:16
  • I would kinda agree that I probably did not grasp everything but I don't think I am missing that much. My main issue is that I get everything in the "readable" version. But I can't seem to understand how I get the "plain" version. Basically I want to get from "ã" to "\xe3". Not more. Not less. I tried to read up but I'm at a point where I just gave up completely. – Erasio Nov 20 '14 at 12:23
  • 1
    possible duplicate of [Python - Unicode to ASCII conversion](http://stackoverflow.com/questions/19527279/python-unicode-to-ascii-conversion) – tripleee Nov 20 '14 at 12:43
  • 1
    Please read http://www.joelonsoftware.com/articles/Unicode.html – jsbueno Nov 20 '14 at 12:47

3 Answers3

2

"\xe3" in python is a string literal that represents a single byte with value 227:

>>> print len("\xe3")
1
>>> print ord("\xe3")
227

This single byte represents the 'ã' character in the latin-1 encoding (http://en.wikipedia.org/wiki/ISO/IEC_8859-1).

"ã" in python is a string literal consisting of two bytes: 0xC3, 0xA3 (195, 163):

>>> print len("ã")
2
>>> print ord("ã"[0])
195
>>> print ord("ã"[1])
163

This byte sequence is the UTF-8 encoding of the character "ã".

So, to go from "ã" in python to "\xe3", you first need to decode the utf-8 byte sequence into a python unicode string:

>>> "ã".decode("utf-8")
u'\xe3'

Now, you can take that unicode string and encode it however you like (e.g. into latin-1):

>>> "ã".decode("utf-8").encode("latin-1")
'\xe3'
Tom Dalton
  • 6,122
  • 24
  • 35
0

Please read http://www.joelonsoftware.com/articles/Unicode.html . You should realize tehre is no such a thing as "a plain encoded string". There is "an encoded string in a given text encoding". So you are really in need to understand the better the concepts of Unicode.

Among other things, this is plain wrong: "The result is sorted after the names (strings) I provided however they are returned in encoded form." JSON uses Unicode, so you get the string in a decoded form.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
0

Since I assume you are, perhaps unknowingly, working with UTF-8, you should be aware that \xe3 is the Unicode code point for the character ã. Not to be mistaken for the actual bytes that UTF-8 uses to reference that code point:

http://hexutf8.com/?q=U+e3

I.e. UTF-8 maps the byte sequence c3 a3 to the code point U+e3 which represents the character ã.

UTF-16 maps a different byte sequence, 00 e3 to that exact same code point. (Note how much simpler, but less space efficient the UTF-16 encoding is...)

jar
  • 381
  • 3
  • 15