4

I was wondering how to determine the encoding of a unicode.

I know I've read about this somewhere, I just don't remember if it was possible or not but I want to believe there was a way.

Let's say I have a unicode with latin-1 encoding, I'd like to dynamically encode it with the same encoding used when decoding it...

Frankly I'd like to turn it into a utf-8 unicode without messing up the characters before working with it.

I.e:

latin1_unicode = 'åäö'.decode('latin-1')
utf8_unicode = latin.encode('latin-1').decode('utf-8')
JayDL
  • 215
  • 5
  • 12
  • Is this Python 3 perchance? Otherwise, what do you mean by 'a unicode'? – cha0site Jan 26 '12 at 10:22
  • Are you asking how to guess the encoding of an array of bytes? – Raymond Hettinger Jan 26 '12 at 10:24
  • I mean unicode instances. And no, It's not Python 3. I'm asking how to determine the encoding of the unicode string, either through it's code points and how it chooses to represent the characters through given encoding, or any other way. Whichever way possible. – JayDL Jan 26 '12 at 10:37
  • 1
    It doesn't make any sense at all to ask for "the encoding of a unicode [string]". By definition, a unicode string is not encoded. – Daniel Roseman Jan 26 '12 at 10:59
  • No but you decode it using an encoding. If you want to encode it again, you'd need that very same encoding if you want the characters to be represented correctly. If I decode a string with latin1 and then encode it with utf8 the string won't look the same. The encoding for the unicode string is used to map the code points to various letters. – JayDL Jan 26 '12 at 11:06
  • 1
    @JayLev, that's complete nonsense. Once you've decoded a string to unicode, it's unicode. It has no "memory" of what it used to be, and doesn't care what you encode it as afterwards. If you want a utf-8 string, you can encode it as one. There's no "messing it up". – Daniel Roseman Jan 26 '12 at 11:31
  • @DanielRoseman, see Alien Life Forms answer. – JayDL Jan 26 '12 at 11:39
  • Yes. You'll note it doesn't have anything to do with how a string was originally decoded into unicode, but rather how if you decode a string using the wrong encoding, it'll end up as garbage. But your question isn't about that, it's about strings that you already have as unicode. – Daniel Roseman Jan 26 '12 at 13:02
  • 1
    I've made myself misunderstood then, my bad. I actually get a decoded unicode, which was decoded using the wrong encoding. So I formulated a question with the example of me already having that unicode string in an attempt to find out how I could fix it. – JayDL Jan 26 '12 at 13:12

1 Answers1

1

If, in "determine the encoding of a unicode", "unicode" is the python data type, then you cannot do it, as "encoding" refers to the original byte patterns that represented the string when it was input (say, read from a file, a database, you name it). By the time it becomes a python 'unicode' type (an internal representation) the string has either been decoded behind the lines or has thrown a decoding exception because a byte sequence did not jibe with the system encoding.

Shadyabhi's answer refers to the (common) case in which you are reading bytes from a file (which you could be very well be stuffing in a string - not a python unicode string) and need to guess in what encoding they were saved. Strictly speaking, you cannot have a "latin1 unicode python string": a unicode python string has no encoding (encoding may be defined as the process that translates a character to a byte pattern and decoding as the inverse process; a decoded sring has therfore no encoding - though it can be encoded in several ways for storage/external representation purposes).

For instance on my machine:

In [35]: sys.stdin.encoding
Out[35]: 'UTF-8'

In [36]: a='è'.decode('UTF-8')

In [37]: b='è'.decode('latin-1')

In [38]: a
Out[38]: u'\xe8'

In [39]: b
Out[39]: u'\xc3\xa8'
In [41]: sys.stdout.encoding
Out[41]: 'UTF-8'

In [42]: print b #it's garbage
è

In [43]: print a #it's OK
è

Which means that, in your example, latin1_unicode will contain garbage if the default encoding happens to be UTF-8, or UTF-16, or anything different from latin1.

So what you (may) want to to do is:

  1. Ascertain the encoding of your data source - perhaps using one of Shadyabhi's methods
  2. Decode the data according to (1), save it in python unicode strings
  3. Encode it using the original encoding (if that's what serves your needs) or some other encoding of your choosing.
Alien Life Form
  • 1,884
  • 1
  • 19
  • 27
  • I'm aware of all this. I was asking how to find the "Decoding Method". I know the reason why the latin1 decoded unicode looks like garbage, it's because the defalt encoding is utf-8. But I get unicode strings from different sources which use different encodings. Which is why I have to figure out how they were decoded, so that I dont have have a bunch of strings looking like garbage. I could change the encoding on the different sources to make it work, but I want it to work dynamically. – JayDL Jan 26 '12 at 11:20
  • Once you have a Unicode string, the information about the source encoding **no longer exists**. I don't understand what you mean about "looking like garbage". The Unicode string is a Unicode string, and correctly represents every character. Why don't you take several steps back, and walk us through **exactly what you want to do**, step by step? – Karl Knechtel Jan 26 '12 at 12:25
  • I wont hide the fact that I just recently learned about how to work with unicode and different encodings. So I might be a little out there. The thing is, I have two databases, and for some reason they have different collations for text fields. So when I define the field type as unicode in my ORM (SQLAlchemy) I get different results for the same text. So they're stored differently, however, I work with unicode throughout my entire system. So I figured either I'd have to update the fields in the with a query or correct the values by checking how they were encoded/decoded. – JayDL Jan 26 '12 at 12:59
  • If "collation" (i.e. string ordering) is your actual concern, then all we've been saying so far on encoding is totally OT (or tangential at best). – Alien Life Form Jan 30 '12 at 16:31