9

Is there any way to determine a byte array's encoding in C#?

I have any string, like "Lorem ipsum áéíóú ñÑç", and I get bytes array using several encodings.

I would like a only method for detect encoding in byte array and I get string value again.

Other issue, maybe, I'll have a column in database which store BLOB (like byte array). A string previously converted to byte array in UTF-8. Maybe another application converts a string to byte array using Unicode encoding.

In a database column there are byte arrays in several encodings. It would be very useful detect byte array's encoding. I need a way to find encoding of byte array.

Tests:

string DataXmlForSupport = "<support><machinename></machinename><comments>Este es el log 1 áéíóú</comments></support>";
        string DataXmlForSupport2 = "Lorem ipsum áéíóú ñÑç";

        [TestMethod]
        public void Encoding_byte_array_string()
        {
            var uencoding = new System.Text.UnicodeEncoding();
            byte[] data = uencoding.GetBytes(DataXmlForSupport);

            var dataXml = Encoding.Unicode.GetString(data);
            Assert.AreEqual(DataXmlForSupport, dataXml, "Se esperaba resultados Unicode");

            dataXml = Encoding.UTF8.GetString(data);
            Assert.AreNotEqual(DataXmlForSupport, dataXml, "NO Se esperaba resultados UTF8");

            var utf8 = new System.Text.UTF8Encoding();
            data = utf8.GetBytes(DataXmlForSupport2);

            dataXml = Encoding.UTF8.GetString(data);
            Assert.AreEqual(DataXmlForSupport2, dataXml, "Se esperaba resultados UTF8");

            dataXml = Encoding.Unicode.GetString(data);
            Assert.AreNotEqual(DataXmlForSupport2, dataXml, "NO Se esperaba resultados Unicode");

        }
Kiquenet
  • 14,494
  • 35
  • 148
  • 243
  • You should fix your database to only have one encoding, or store the encoding name in a separate column. It is not possible to reliably detect encodings. – SLaks Oct 22 '13 at 13:47
  • Typically it's your job to associate the encoding with the data. For example in most XML/HTML files one of the first things you'll see is an attribute that describes the encoding. If the encoding is not supplied then based on the spec there is usually a default encoding which is presumed. – Trevor Elliott Oct 22 '13 at 13:48
  • possible duplicate of [How to detect the character encoding of a text file?](http://stackoverflow.com/questions/4520184/how-to-detect-the-character-encoding-of-a-text-file) – Jim Dagg Oct 22 '13 at 14:09
  • @JimDagg text file is not same a string, any fews differences I think. Anyway, maybe share knowledge both questions. – Kiquenet Oct 23 '13 at 06:16

3 Answers3

4

In short, no. Please see How to detect the character encoding of a text file? for a detailed answer on various encodings and why they can't be automatically determined.

Your best solution is to convert the string from it's original encoding to UTF8 and convert that to a byte array. Then you'll know your byte array's encoding...

Community
  • 1
  • 1
David Arno
  • 42,717
  • 16
  • 86
  • 131
  • If I convert string to UTF8 encoding, byte array's encoding is UTF8. Anyway, how best way safely to convert string to UTF8? – Kiquenet Oct 23 '13 at 05:32
3

I realize I'm late to the party here, but I just had a need to do this very thing and found a good way to do it:

byte[] data; // Populate this however you see fit with your data
string text;
Encoding enc;
using (StreamReader reader = new StreamReader(new MemoryStream(data), 
                                              detectEncodingFromByteOrderMarks: true))
{
    text = reader.ReadToEnd();
    enc = reader.CurrentEncoding; // the reader detects the encoding for you!
}
Greg Loomis
  • 109
  • 5
  • 1
    This will only work if the data contains a [BOM](https://en.wikipedia.org/wiki/Byte_order_mark) at the beginning, which is not always the case. Otherwise, it will pretty much just default to assuming it is UTF-8. – Demonslay335 Dec 05 '20 at 16:44
-1

Complementing other response, you could try do:

string str = BitConverter.ToString(byte_array);
byte[] byte_array = Encoding.UTF8.GetBytes(str);
Jesper Lundin
  • 168
  • 1
  • 8
Lucas Prestes
  • 362
  • 1
  • 4
  • 19