19

I need to convert a string to UTF-8 in C#. I've already try many ways but none works as I wanted. I converted my string into a byte array and then to try to write it to an XML file (which encoding is UTF-8....) but either I got the same string (not encoded at all) either I got a list of byte which is useless.... Does someone face the same issue ?

Edit : This is some of the code I used :

str= "testé";
byte[] utf8Bytes = Encoding.UTF8.GetBytes(str);
return Encoding.UTF8.GetString(utf8Bytes);

The result is "testé" or I expected something like "testé"...

Celero
  • 445
  • 2
  • 6
  • 19
  • Your existing code would expain your problem better, and, what are you expecting to get, if not a list of bytes or a readable string? Surely in XML a readable string is exactly what you want? – Jodrell Jun 01 '11 at 09:12
  • Also, what do you mean when you say you "got the same string not encoded at all"? If you take a UTF-16 string, and save it to a UTF-8-encoded XML file, and then open the XML file in a text editor, you will see "the same string". You would only notice a difference if you open the file using a hex editor. – phoog Jun 01 '11 at 09:13
  • Does this answer your question? [UTF-16 to UTF-8 conversion (for scripting in Windows)](https://stackoverflow.com/questions/265370/utf-16-to-utf-8-conversion-for-scripting-in-windows) – Ian Kemp Jul 23 '21 at 11:44

6 Answers6

20

If you want a UTF8 string, where every byte is correct ('Ö' -> [195, 0] , [150, 0]), you can use the followed:

public static string Utf16ToUtf8(string utf16String)
{
   /**************************************************************
    * Every .NET string will store text with the UTF16 encoding, *
    * known as Encoding.Unicode. Other encodings may exist as    *
    * Byte-Array or incorrectly stored with the UTF16 encoding.  *
    *                                                            *
    * UTF8 = 1 bytes per char                                    *
    *    ["100" for the ansi 'd']                                *
    *    ["206" and "186" for the russian 'κ']                   *
    *                                                            *
    * UTF16 = 2 bytes per char                                   *
    *    ["100, 0" for the ansi 'd']                             *
    *    ["186, 3" for the russian 'κ']                          *
    *                                                            *
    * UTF8 inside UTF16                                          *
    *    ["100, 0" for the ansi 'd']                             *
    *    ["206, 0" and "186, 0" for the russian 'κ']             *
    *                                                            *
    * We can use the convert encoding function to convert an     *
    * UTF16 Byte-Array to an UTF8 Byte-Array. When we use UTF8   *
    * encoding to string method now, we will get a UTF16 string. *
    *                                                            *
    * So we imitate UTF16 by filling the second byte of a char   *
    * with a 0 byte (binary 0) while creating the string.        *
    **************************************************************/

    // Storage for the UTF8 string
    string utf8String = String.Empty;

    // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
    byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
    byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);

    // Fill UTF8 bytes inside UTF8 string
    for (int i = 0; i < utf8Bytes.Length; i++)
    {
        // Because char always saves 2 bytes, fill char with 0
        byte[] utf8Container = new byte[2] { utf8Bytes[i], 0 };
        utf8String += BitConverter.ToChar(utf8Container, 0);
    }

    // Return UTF8
    return utf8String;
}

In my case the DLL request is a UTF8 string too, but unfortunately the UTF8 string must be interpreted with UTF16 encoding ('Ö' -> [195, 0], [19, 32]). So the ANSI '–' which is 150 has to be converted to the UTF16 '–' which is 8211. If you have this case too, you can use the following instead:

public static string Utf16ToUtf8(string utf16String)
{
    // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
    byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
    byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);

    // Return UTF8 bytes as ANSI string
    return Encoding.Default.GetString(utf8Bytes);
}

Or the Native-Method:

[DllImport("kernel32.dll")]
private static extern Int32 WideCharToMultiByte(UInt32 CodePage, UInt32 dwFlags, [MarshalAs(UnmanagedType.LPWStr)] String lpWideCharStr, Int32 cchWideChar, [Out, MarshalAs(UnmanagedType.LPStr)] StringBuilder lpMultiByteStr, Int32 cbMultiByte, IntPtr lpDefaultChar, IntPtr lpUsedDefaultChar);

public static string Utf16ToUtf8(string utf16String)
{
    Int32 iNewDataLen = WideCharToMultiByte(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf16String, utf16String.Length, null, 0, IntPtr.Zero, IntPtr.Zero);
    if (iNewDataLen > 1)
    {
        StringBuilder utf8String = new StringBuilder(iNewDataLen);
        WideCharToMultiByte(Convert.ToUInt32(Encoding.UTF8.CodePage), 0, utf16String, -1, utf8String, utf8String.Capacity, IntPtr.Zero, IntPtr.Zero);

        return utf8String.ToString();
    }
    else
    {
        return String.Empty;
    }
}

If you need it the other way around, see Utf8ToUtf16. Hope I could be of help.

Community
  • 1
  • 1
MEN
  • 547
  • 6
  • 7
  • There is no reason to call `Encoding.Unicode.GetBytes()` then `Encoding.Convert`, just call `Encoding.UTF8.GetBytes()` instead. – BlueRaja - Danny Pflughoeft May 27 '13 at 20:36
  • 1
    The first two methods don't even work for me, as @Thomas Levesque said, "A string in C# is always UTF-16, there is no way to "convert" it". So, since I'm calling a native function that requires a wide string, and I need to get a decoded string back, your native method is what works for me. – leetNightshade Jun 06 '13 at 16:44
  • @leetNightshade I am sorry to read that, it worked for me while calling a Delphi-DLL's Method, so I posted them here. I hope the native way does work for you and I could help you anyway. – MEN Jul 16 '13 at 08:15
  • Just to be sure: The string before converting will still be UTF-16, it just contains UTF-8 encoding data. You can't handle strings using the UTF-8 encoding, because .NET will always use the UTF-16 encoding to handle strings. – MEN Jul 16 '13 at 08:16
  • 1
    @MEN. Thank you very much. One question, every string I tested worked with the 2nd method (Encoding.Default). Will this be a problem when the user has an OS with a different default? –  Aug 13 '14 at 21:27
  • 1
    @Lara late but not never. Yes it should work because you can find in the documentation at "https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding" the line "For ANSI encodings, the best fit behavior is the default." – MEN Oct 27 '20 at 13:30
  • @MEN. Statement about default behavior is in the section about fallbacks. It says that you shouldn't use `Encoding(Int32, EncoderFallback, DecoderFallback)` constructor for ANSI encodings. `Default` property is NOT recommended for use [link](https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding). Don't confuse others – JL0PD Nov 20 '20 at 08:47
19

A string in C# is always UTF-16, there is no way to "convert" it. The encoding is irrelevant as long as you manipulate the string in memory, it only matters if you write the string to a stream (file, memory stream, network stream...).

If you want to write the string to a XML file, just specify the encoding when you create the XmlWriter

Thomas Levesque
  • 286,951
  • 70
  • 623
  • 758
  • I've finally found a solution...I've tried all that you advised me but that didn't work for me... This piece of code worked for me : using (TextWriter writer = new StreamWriter(filename) { xmlDoc.Save(writer); } – Celero Jun 01 '11 at 12:39
  • @Celero, it's not exactly what I suggested but it's equivalent... StreamWriter uses UTF-8 by default – Thomas Levesque Jun 01 '11 at 15:01
1
    private static string Utf16ToUtf8(string utf16String)
    {
        /**************************************************************
         * Every .NET string will store text with the UTF16 encoding, *
         * known as Encoding.Unicode. Other encodings may exist as    *
         * Byte-Array or incorrectly stored with the UTF16 encoding.  *
         *                                                            *
         * UTF8 = 1 bytes per char                                    *
         *    ["100" for the ansi 'd']                                *
         *    ["206" and "186" for the russian '?']                   *
         *                                                            *
         * UTF16 = 2 bytes per char                                   *
         *    ["100, 0" for the ansi 'd']                             *
         *    ["186, 3" for the russian '?']                          *
         *                                                            *
         * UTF8 inside UTF16                                          *
         *    ["100, 0" for the ansi 'd']                             *
         *    ["206, 0" and "186, 0" for the russian '?']             *
         *                                                            *
         * We can use the convert encoding function to convert an     *
         * UTF16 Byte-Array to an UTF8 Byte-Array. When we use UTF8   *
         * encoding to string method now, we will get a UTF16 string. *
         *                                                            *
         * So we imitate UTF16 by filling the second byte of a char   *
         * with a 0 byte (binary 0) while creating the string.        *
         **************************************************************/

        // Get UTF16 bytes and convert UTF16 bytes to UTF8 bytes
        byte[] utf16Bytes = Encoding.Unicode.GetBytes(utf16String);
        byte[] utf8Bytes = Encoding.Convert(Encoding.Unicode, Encoding.UTF8, utf16Bytes);
        char[] chars = (char[])Array.CreateInstance(typeof(char), utf8Bytes.Length);

        for (int i = 0; i < utf8Bytes.Length; i++)
        {
            chars[i] = BitConverter.ToChar(new byte[2] { utf8Bytes[i], 0 }, 0);
        }

        // Return UTF8
        return new String(chars);
    }

In the original post author concatenated strings. Every sting operation will result in string recreation in .Net. String is effectively a reference type. As a result, the function provided will be visibly slow. Don't do that. Use array of chars instead, write there directly and then convert result to string. In my case of processing 500 kb of text difference is almost 5 minutes.

  • actually, still there is an error. if utf16Bytes array actually contains 2 byte symbol, conversion will not work correctly, as we are just dropping the high byte – Pavel Kasiankou Oct 12 '16 at 14:25
0
class Program
{
    static void Main(string[] args)
    {
        String unicodeString =
        "This Unicode string contains two characters " +
        "with codes outside the traditional ASCII code range, " +
        "Pi (\u03a0) and Sigma (\u03a3).";

        Console.WriteLine("Original string:");
        Console.WriteLine(unicodeString);
        UnicodeEncoding unicodeEncoding = new UnicodeEncoding();
        byte[] utf16Bytes = unicodeEncoding.GetBytes(unicodeString);
        char[] chars = unicodeEncoding.GetChars(utf16Bytes, 2, utf16Bytes.Length - 2);
        string s = new string(chars);
        Console.WriteLine();
        Console.WriteLine("Char Array:");
        foreach (char c in chars) Console.Write(c);
        Console.WriteLine();
        Console.WriteLine();
        Console.WriteLine("String from Char Array:");
        Console.WriteLine(s);

        Console.ReadKey();
    }
}
0

Check the Jon Skeet answer to this other question: UTF-16 to UTF-8 conversion (for scripting in Windows)

It contains the source code that you need.

Hope it helps.

Community
  • 1
  • 1
Jonathan
  • 11,809
  • 5
  • 57
  • 91
0

does this example help ?

using System;
using System.IO;
using System.Text;

class Test
{
   public static void Main() 
   {        
    using (StreamWriter output = new StreamWriter("practice.txt")) 
    {
        // Create and write a string containing the symbol for Pi.
        string srcString = "Area = \u03A0r^2";

        // Convert the UTF-16 encoded source string to UTF-8 and ASCII.
        byte[] utf8String = Encoding.UTF8.GetBytes(srcString);
        byte[] asciiString = Encoding.ASCII.GetBytes(srcString);

        // Write the UTF-8 and ASCII encoded byte arrays. 
        output.WriteLine("UTF-8  Bytes: {0}", BitConverter.ToString(utf8String));
        output.WriteLine("ASCII  Bytes: {0}", BitConverter.ToString(asciiString));


        // Convert UTF-8 and ASCII encoded bytes back to UTF-16 encoded  
        // string and write.
        output.WriteLine("UTF-8  Text : {0}", Encoding.UTF8.GetString(utf8String));
        output.WriteLine("ASCII  Text : {0}", Encoding.ASCII.GetString(asciiString));

        Console.WriteLine(Encoding.UTF8.GetString(utf8String));
        Console.WriteLine(Encoding.ASCII.GetString(asciiString));
    }
}

}

yossi
  • 12,945
  • 28
  • 84
  • 110