1

I have a problem. Unicode 2019 is this character: ’

It is a right single quote. It gets encoded as UTF8. But I fear it gets double-encoded.

>>> u'\u2019'.encode('utf-8')
'\xe2\x80\x99'
>>> u'\xe2\x80\x99'.encode('utf-8')
'\xc3\xa2\xc2\x80\xc2\x99'
>>> u'\xc3\xa2\xc2\x80\xc2\x99'.encode('utf-8')
'\xc3\x83\xc2\xa2\xc3\x82\xc2\x80\xc3\x82\xc2\x99'
>>> print(u'\u2019')
’
>>> print('\xe2\x80\x99')
’
>>> print('\xc3\xa2\xc2\x80\xc2\x99')
’
>>> '\xc3\xa2\xc2\x80\xc2\x99'.decode('utf-8')
u'\xe2\x80\x99'
>>> '\xe2\x80\x99'.decode('utf-8')
u'\u2019'

This is the principle used above.

How can I do the bolded parts, in C#?

How can I take a UTF8-Encoded string, conver to byte array, convert THAT to a string in, and then do decode again?

I tried this method, but the output is not suitable in ISO-8859-1, it seems...

    string firstLevel = "’";
    byte[] decodedBytes = Encoding.UTF8.GetBytes(firstLevel);

    Console.WriteLine(Encoding.UTF8.GetChars(decodedBytes));
    // ’

    Console.WriteLine(decodeUTF8String(firstLevel));
    //â�,��"�
    //I was hoping for this:
    //’

Understanding Update:

Jon's helped me with my most basic question: going from "’" to "’ and thence to "’" But I want to honor the recommendations at the heart of his answer:

  1. understand what is happening
  2. fix the original sin

I made an effort at number 1.

Encoding/Decoding

I get so confused with terms like these. I confuse them with terms like Encrypting/Decrypting, simply because of "En..." and "De..." I forget what they translate from, and what they translate to. I confuse these start points and end points; could it be related to other vague terms like hex, character entities, code points, and character maps.

I wanted to settle the definition at a basic level. Encoding and Decoding in the context of this question is:

  1. Decode
    • Corresponds to C# {Encoding}.'''GetString'''(bytesArray)
    • Corresponds to Python stringObject.'''decode'''({Encoding})
    • Takes bytes as input, and converts to string representation as output, according to some conversion scheme called an "encoding", represented by {Encoding} above.
    • Bytes -> String
  2. Encode
    • Corresponds to C# {Encoding}.'''GetBytes'''(stringObject)
    • Corresponds to Python stringObject.'''encode'''({Encoding})
    • The reverse of Decode.
    • String -> Bytes (except for Python)

Bytes vs Strings in Python

So Encode and Decode take us back and forth between bytes and strings.

While Python helped me understand what was going wrong, it could also confuse my understanding of the "fundamentals" of Encoding/Decoding. Jon said:

It's a shame that Python hides [the difference between binary data and text data] to a large extent

I think this is what PEP means when it says:

Python's current string objects are overloaded. They serve to hold both sequences of characters and sequences of bytes. This overloading of purpose leads to confusion and bugs.

Python 3.* does not overload strings in this way.:

Python 2.7

>>> #Encoding example. As a generalization, "Encoding" produce bytes.
>>> #In Python 2.7, strings are overloaded to serve as bytes
>>> type(u'\u2019'.encode('utf-8'))
<type 'str'>

Python 3.*

>>> #In Python 3.*, bytes and strings are distinct
>>> type('\u2019'.encode('utf-8'))
<class 'bytes'>

Another important (related) difference between Python 2 and 3, is their default encoding:

>>>import sys
>>>sys.getdefaultencoding()

Python 2

'ascii'

Python 3

'utf-8'

And while Python 2 says 'ascii', I think it means a specific type of ASCII;

  • It does '''not''' mean ISO-8859-1, which supports range(256), which is what Jon uses to decode (discussed below)
  • It means ASCII, the plainest variety, which are only range(128)

And while Python 3 no longer overloads string as both bytes, and strings, the interpreter still makes it easy to ignore what's happening and move between types. i.e.

  • just put a 'u' before a string in Python 2.* and it's a Unicode literal
  • just put a 'b' before a string in Python 3.* and it's a Bytes literal

Encoding and C

Jon points out that C# uses UTF-16, to correct my "UTF-8 Encoded String" comment, above;

Every string is effectively UTF-16. My understanding of is: if C# has a string object "s", the computer memory actually has bytes corresponding to that character in the UTF-16 map. That is, (including byte-order-mark??) feff0073.

He also uses ISO-8859-1 in the hack method I requested. I'm not sure why. My head is hurting at the moment, so I'll return when I have some perspective.

I'll return to this post. I hope I'm explaining properly. I'll make it a Wiki?

Community
  • 1
  • 1
Nate Anderson
  • 18,334
  • 18
  • 100
  • 135
  • 1
    There's no such thing as a "UTF-8-encoded string" in .NET. *Every* string is effectively UTF-16. If you *possibly* can, you should fix the code that's behaving badly (encoding binary data as if it were text). – Jon Skeet Jul 26 '13 at 22:55
  • Thanks. I need a deeper understanding of encoding. I'll re-read Spolsky et al. Is it possible to go from "’" to "’" in C#? Unfortunately, I can't fix the original sin. – Nate Anderson Jul 26 '13 at 23:01

1 Answers1

4

You need to understand that fundamentally this is due to someone misunderstanding the difference between binary data and text data. It's a shame that Python hides that difference to a large extent - it's quite hard to accidentally perform this particular form of double-encoding in C#. Still, this code should work for you:

using System;
using System.Text;

class Test
{
    static void Main()
    {
        // Avoid encoding issues in the source file itself...
        string firstLevel = "\u00c3\u00a2\u00c2\u0080\u00c2\u0099";
        string secondLevel = HackDecode(firstLevel);
        string thirdLevel = HackDecode(secondLevel);
        Console.WriteLine("{0:x}", (int) thirdLevel[0]); // 2019
    }

    // Converts a string to a byte array using ISO-8859-1, then *decodes*
    // it using UTF-8. Any use of this method indicates broken data to start
    // with. Ideally, the source of the error should be fixed.
    static string HackDecode(string input)
    {
        byte[] bytes = Encoding.GetEncoding(28591)
                               .GetBytes(input);
        return Encoding.UTF8.GetString(bytes);
    }
}
Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Thank you, Jon. I am both trying to understand the fundamentals, and empowered to find the original sin. I'm not understanding one part in particular about your answer: Why is ISO-8859-1 relevant? Why is it used to get Bytes from a String? I thought "UTF-16" might be relevant given your point: "Every string is effectively UTF-16." And is the answer related to this post: (I realize its OP does something similar http://stackoverflow.com/q/1922199/1175496) – Nate Anderson Jul 28 '13 at 00:44
  • @TheRedPea: ISO-8859-1 is an encoding which converts the first 256 Unicode characters into a single byte of the same value - which is what's basically going on here. The "double-encoding" is to encode a string as UTF-8, and then treat the results (bytes) *also* as a string, but in ISO-8859-1. – Jon Skeet Jul 28 '13 at 06:18