I have a problem. Unicode 2019 is this character: ’
It is a right single quote. It gets encoded as UTF8. But I fear it gets double-encoded.
>>> u'\u2019'.encode('utf-8')
'\xe2\x80\x99'
>>> u'\xe2\x80\x99'.encode('utf-8')
'\xc3\xa2\xc2\x80\xc2\x99'
>>> u'\xc3\xa2\xc2\x80\xc2\x99'.encode('utf-8')
'\xc3\x83\xc2\xa2\xc3\x82\xc2\x80\xc3\x82\xc2\x99'
>>> print(u'\u2019')
’
>>> print('\xe2\x80\x99')
’
>>> print('\xc3\xa2\xc2\x80\xc2\x99')
’
>>> '\xc3\xa2\xc2\x80\xc2\x99'.decode('utf-8')
u'\xe2\x80\x99'
>>> '\xe2\x80\x99'.decode('utf-8')
u'\u2019'
This is the principle used above.
How can I do the bolded parts, in C#?
How can I take a UTF8-Encoded string, conver to byte array, convert THAT to a string in, and then do decode again?
I tried this method, but the output is not suitable in ISO-8859-1, it seems...
string firstLevel = "’";
byte[] decodedBytes = Encoding.UTF8.GetBytes(firstLevel);
Console.WriteLine(Encoding.UTF8.GetChars(decodedBytes));
// ’
Console.WriteLine(decodeUTF8String(firstLevel));
//â�,��"�
//I was hoping for this:
//’
Understanding Update:
Jon's helped me with my most basic question: going from "’" to "’ and thence to "’" But I want to honor the recommendations at the heart of his answer:
- understand what is happening
- fix the original sin
I made an effort at number 1.
Encoding/Decoding
I get so confused with terms like these. I confuse them with terms like Encrypting/Decrypting, simply because of "En..." and "De..." I forget what they translate from, and what they translate to. I confuse these start points and end points; could it be related to other vague terms like hex, character entities, code points, and character maps.
I wanted to settle the definition at a basic level. Encoding and Decoding in the context of this question is:
- Decode
- Corresponds to C# {Encoding}.'''GetString'''(bytesArray)
- Corresponds to Python stringObject.'''decode'''({Encoding})
- Takes bytes as input, and converts to string representation as output, according to some conversion scheme called an "encoding", represented by {Encoding} above.
- Bytes -> String
- Encode
- Corresponds to C# {Encoding}.'''GetBytes'''(stringObject)
- Corresponds to Python stringObject.'''encode'''({Encoding})
- The reverse of Decode.
- String -> Bytes (except for Python)
Bytes vs Strings in Python
So Encode and Decode take us back and forth between bytes and strings.
While Python helped me understand what was going wrong, it could also confuse my understanding of the "fundamentals" of Encoding/Decoding. Jon said:
It's a shame that Python hides [the difference between binary data and text data] to a large extent
I think this is what PEP means when it says:
Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs.
Python 3.* does not overload strings in this way.:
Python 2.7
>>> #Encoding example. As a generalization, "Encoding" produce bytes.
>>> #In Python 2.7, strings are overloaded to serve as bytes
>>> type(u'\u2019'.encode('utf-8'))
<type 'str'>
Python 3.*
>>> #In Python 3.*, bytes and strings are distinct
>>> type('\u2019'.encode('utf-8'))
<class 'bytes'>
Another important (related) difference between Python 2 and 3, is their default encoding:
>>>import sys
>>>sys.getdefaultencoding()
Python 2
'ascii'
Python 3
'utf-8'
And while Python 2 says 'ascii', I think it means a specific type of ASCII;
- It does '''not''' mean ISO-8859-1, which supports range(256), which is what Jon uses to decode (discussed below)
- It means ASCII, the plainest variety, which are only range(128)
And while Python 3 no longer overloads string as both bytes, and strings, the interpreter still makes it easy to ignore what's happening and move between types. i.e.
- just put a 'u' before a string in Python 2.* and it's a Unicode literal
- just put a 'b' before a string in Python 3.* and it's a Bytes literal
Encoding and C
Jon points out that C# uses UTF-16, to correct my "UTF-8 Encoded String" comment, above;
Every string is effectively UTF-16. My understanding of is: if C# has a string object "s", the computer memory actually has bytes corresponding to that character in the UTF-16 map. That is, (including byte-order-mark??) feff0073.
He also uses ISO-8859-1 in the hack method I requested. I'm not sure why. My head is hurting at the moment, so I'll return when I have some perspective.
I'll return to this post. I hope I'm explaining properly. I'll make it a Wiki?