9

I have string that displays UTF-8 encoded characters, and I want to convert it back to Unicode.

For now, my implementation is the following:

public static string DecodeFromUtf8(this string utf8String)
{
    // read the string as UTF-8 bytes.
    byte[] encodedBytes = Encoding.UTF8.GetBytes(utf8String);

    // convert them into unicode bytes.
    byte[] unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);

    // builds the converted string.
    return Encoding.Unicode.GetString(encodedBytes);
}

I am playing with the word "déjà". I have converted it into UTF-8 through this online tool, and so I started to test my method with the string "déjÃ".

Unfortunately, with this implementation the string just remains the same.

Where am I wrong?

remio
  • 1,242
  • 2
  • 15
  • 36
  • 13
    That's not a UTF8 string. That's a corrupted string that has been badly converted from bytes using the wrong encoding. – spender Jul 02 '12 at 12:49
  • 27
    UTF-8 *is* Unicode. – Alexey Frunze Jul 02 '12 at 12:49
  • 2
    The source string is invalid UTF-8. – alexn Jul 02 '12 at 12:49
  • 4
    C# strings have 16 bits characters, so they can't possibly be UTF-8 encoded. I think the system doesn't understand what you're trying to do. Where do you get the miscoded strings from? – Mr Lister Jul 02 '12 at 12:50
  • The function must accept `byte[]` in the first place, not `string`. – GSerg Jul 02 '12 at 12:51
  • 7
    @AlexeyFrunze and richard: If it helps, read "UTF-16" for "Unicode" in the question. C#'s native string encoding is UTF-16, and it is called Unicode in the docs. – Mr Lister Jul 02 '12 at 12:51
  • @MrLister Oh, so we have a case of confusing terminology. – Alexey Frunze Jul 02 '12 at 12:54
  • 1
    As this web page is in utf-8 I am looking at the utf-8 for déjà and it looks like déjÃ. – ctrl-alt-delor Jul 02 '12 at 12:57
  • @spender, can you be more specific, please? How can you see my UTF-8 string is corrupted? (Also, I updated my question to show where I got it). – remio Jul 02 '12 at 12:57
  • 8
    You might to start with [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/) to understand what you're trying to do... – deceze Jul 02 '12 at 12:58
  • @MrLister: C# strings use 16-bit **code units**. Unicode characters are 21 bits, of course. – Joey Jul 02 '12 at 13:11
  • 1
    @Joey Again, confusing terminology. The basic unit of a string is called a `char` in C# (or a `Char` in .NET lingo) and they're 16 bits. But there is no such thing as a 21-bit Unicode character. At least the phrase "21 bit character" does not appear anywhere on the Unicode site, and no implementation in the world has 21 bits. (By the way, I proposed a 24-bit encoding once (UTF-24), but that was declined.) – Mr Lister Jul 02 '12 at 14:22
  • 2
    @AlexeyFrunze utf is unicode ?????? utf-8 is a way to _store_ unicode code points – Royi Namir Apr 28 '13 at 08:37

4 Answers4

20

So the issue is that UTF-8 code unit values have been stored as a sequence of 16-bit code units in a C# string. You simply need to verify that each code unit is within the range of a byte, copy those values into bytes, and then convert the new UTF-8 byte sequence into UTF-16.

public static string DecodeFromUtf8(this string utf8String)
{
    // copy the string as UTF-8 bytes.
    byte[] utf8Bytes = new byte[utf8String.Length];
    for (int i=0;i<utf8String.Length;++i) {
        //Debug.Assert( 0 <= utf8String[i] && utf8String[i] <= 255, "the char must be in byte's range");
        utf8Bytes[i] = (byte)utf8String[i];
    }

    return Encoding.UTF8.GetString(utf8Bytes,0,utf8Bytes.Length);
}

DecodeFromUtf8("d\u00C3\u00A9j\u00C3\u00A0"); // déjà

This is easy, however it would be best to find the root cause; the location where someone is copying UTF-8 code units into 16 bit code units. The likely culprit is somebody converting bytes into a C# string using the wrong encoding. E.g. Encoding.Default.GetString(utf8Bytes, 0, utf8Bytes.Length).


Alternatively, if you're sure you know the incorrect encoding which was used to produce the string, and that incorrect encoding transformation was lossless (usually the case if the incorrect encoding is a single byte encoding), then you can simply do the inverse encoding step to get the original UTF-8 data, and then you can do the correct conversion from UTF-8 bytes:

public static string UndoEncodingMistake(string mangledString, Encoding mistake, Encoding correction)
{
    // the inverse of `mistake.GetString(originalBytes);`
    byte[] originalBytes = mistake.GetBytes(mangledString);
    return correction.GetString(originalBytes);
}

UndoEncodingMistake("d\u00C3\u00A9j\u00C3\u00A0", Encoding(1252), Encoding.UTF8);
bames53
  • 86,085
  • 15
  • 179
  • 244
  • Thanks barnes53 this exactly answers my question as it produces the result I expect. You could find out what I meant from my confusing question. – remio Jul 03 '12 at 07:42
9

I have string that displays UTF-8 encoded characters

There is no such thing in .NET. The string class can only store strings in UTF-16 encoding. A UTF-8 encoded string can only exist as a byte[]. Trying to store bytes into a string will not come to a good end; UTF-8 uses byte values that don't have a valid Unicode codepoint. The content will be destroyed when the string is normalized. So it is already too late to recover the string by the time your DecodeFromUtf8() starts running.

Only handle UTF-8 encoded text with byte[]. And use UTF8Encoding.GetString() to convert it.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • You pointed out the confusion I wanted to avoid. My string is a unicode string, well is a .Net string, that the debugger displays as `déjÃ`. Hence, my goal is to get another (.Net) string that will be displayed as `déjà` (in the debugger, for instance). – remio Jul 02 '12 at 13:25
  • 1
    You are missing the point of the answer, there is no way to make this work properly for *every* possible utf-8 encoded string. That you could make it work for déjà is merely coincidence. That you are already having trouble with it should be one hint, there's an extra space after the last Ã. A special one, a non-breaking space, code point U+00a0. Which happens to be a valid Unicode code point by accident. – Hans Passant Jul 02 '12 at 13:36
  • Thanks, I think I get it. You mean that I just can't use `string` to store the UTF-8 bytes. However, as you mention it could work by accident, it would be a great help if I could make the accidents work. In other words, I still don't know how to make this conversion in the cases it would work. – remio Jul 02 '12 at 14:29
  • 5
    You can try your luck by using Encoding.Default.GetBytes() to try to recover the byte[]. I would strongly recommend the System.Random class instead, it has a more predictable outcome. – Hans Passant Jul 02 '12 at 14:35
  • I finally found something that (seems to) work/s. First I get a `byte[]` from this infamous UTF-8 string. In this array, I noticed that all the odd indexes contains `0`, so I removed all of them and invoked `unicodeBytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, encodedBytes);` on this result. At the end, I returned `Encoding.Unicode.GetString(unicodeBytes);`. Then, I picked loads of text samples in many languages (thanks Wikipedia), built a big big string, converted it into my infamous UTF-8 format, then decoded it and got the exact same original string. No random, no accident. – remio Jul 02 '12 at 14:56
  • If the string contains zeros at odd-numbered indices then it actually contains utf-16 encoded bytes, not utf-8. – Hans Passant Jul 02 '12 at 16:50
9

If you have a UTF-8 string, where every byte is correct ('Ö' -> [195, 0] , [150, 0]), you can use the following:

public static string Utf8ToUtf16(string utf8String)
{
    /***************************************************************
     * Every .NET string will store text with the UTF-16 encoding, *
     * known as Encoding.Unicode. Other encodings may exist as     *
     * Byte-Array or incorrectly stored with the UTF-16 encoding.  *
     *                                                             *
     * UTF-8 = 1 bytes per char                                    *
     *    ["100" for the ansi 'd']                                 *
     *    ["206" and "186" for the russian '?']                    *
     *                                                             *
     * UTF-16 = 2 bytes per char                                   *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["186, 3" for the russian '?']                           *
     *                                                             *
     * UTF-8 inside UTF-16                                         *
     *    ["100, 0" for the ansi 'd']                              *
     *    ["206, 0" and "186, 0" for the russian '?']              *
     *                                                             *
     * First we need to get the UTF-8 Byte-Array and remove all    *
     * 0 byte (binary 0) while doing so.                           *
     *                                                             *
     * Binary 0 means end of string on UTF-8 encoding while on     *
     * UTF-16 one binary 0 does not end the string. Only if there  *
     * are 2 binary 0, than the UTF-16 encoding will end the       *
     * string. Because of .NET we don't have to handle this.       *
     *                                                             *
     * After removing binary 0 and receiving the Byte-Array, we    *
     * can use the UTF-8 encoding to string method now to get a    *
     * UTF-16 string.                                              *
     *                                                             *
     ***************************************************************/

    // Get UTF-8 bytes and remove binary 0 bytes (filler)
    List<byte> utf8Bytes = new List<byte>(utf8String.Length);
    foreach (byte utf8Byte in utf8String)
    {
        // Remove binary 0 bytes (filler)
        if (utf8Byte > 0) {
            utf8Bytes.Add(utf8Byte);
        }
    }

    // Convert UTF-8 bytes to UTF-16 string
    return Encoding.UTF8.GetString(utf8Bytes.ToArray());
}

In my case the DLL result is a UTF-8 string too, but unfortunately the UTF-8 string is interpreted with UTF-16 encoding ('Ö' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 was converted to the UTF-16 '–' which is 8211. If you have this case too, you can use the following instead:

public static string Utf8ToUtf16(string utf8String)
{
    // Get UTF-8 bytes by reading each byte with ANSI encoding
    byte[] utf8Bytes = Encoding.Default.GetBytes(utf8String);

    // Convert UTF-8 bytes to UTF-16 bytes
    byte[] utf16Bytes = Encoding.Convert(Encoding.UTF8, Encoding.Unicode, utf8Bytes);

    // Return UTF-16 bytes as UTF-16 string
    return Encoding.Unicode.GetString(utf16Bytes);
}

Or the Native-Method:

[DllImport("kernel32.dll")]
private static extern Int32 MultiByteToWideChar(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPStr)] String lpMultiByteStr, Int32 cbMultiByte, [Out, MarshalAs(UnmanagedType.LPWStr)] StringBuilder lpWideCharStr, Int32 cchWideChar);

public static string Utf8ToUtf16(string utf8String)
{
    Int32 iNewDataLen = MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, null, 0);
    if (iNewDataLen > 1)
    {
        StringBuilder utf16String = new StringBuilder(iNewDataLen);
        MultiByteToWideChar(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf8String, -1, utf16String, utf16String.Capacity);

        return utf16String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf16ToUtf8. Hope I could be of help.

Community
  • 1
  • 1
MEN
  • 547
  • 6
  • 7
  • Just to be sure: The string after converting will still be UTF-16, it just contains UTF-8 encoding data. You can't handle strings using the UTF-8 encoding, because .NET will always use the UTF-16 encoding to handle strings. – MEN Jul 16 '13 at 08:08
5

What you have seems to be a string incorrectly decoded from another encoding, likely code page 1252, which is US Windows default. Here's how to reverse, assuming no other loss. One loss not immediately apparent is the non-breaking space (U+00A0) at the end of your string that is not displayed. Of course it would be better to read the data source correctly in the first place, but perhaps the data source was stored incorrectly to begin with.

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        string junk = "déjÃ\xa0";  // Bad Unicode string

        // Turn string back to bytes using the original, incorrect encoding.
        byte[] bytes = Encoding.GetEncoding(1252).GetBytes(junk);

        // Use the correct encoding this time to convert back to a string.
        string good = Encoding.UTF8.GetString(bytes);
        Console.WriteLine(good);
    }
}

Result:

déjà
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251