Encoding.UTF8.GetString doesn't take into account the Preamble/BOM

Question

In .NET, I'm trying to use Encoding.UTF8.GetString method, which takes a byte array and converts it to a string.

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

I know I can use a TextReader to digest the BOM as needed, but I thought that the GetString method should be some kind of a macro that makes our code shorter.

Am I missing something? Is this like so intentionally?

Here's a reproduction code:

static void Main(string[] args)
{
    string s1 = "abc";
    byte[] abcWithBom;
    using (var ms = new MemoryStream())
    using (var sw = new StreamWriter(ms, new UTF8Encoding(true)))
    {
        sw.Write(s1);
        sw.Flush();
        abcWithBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithBom)); // ef, bb, bf, 61, 62, 63
    }

    byte[] abcWithoutBom;
    using (var ms = new MemoryStream())
    using (var sw = new StreamWriter(ms, new UTF8Encoding(false)))
    {
        sw.Write(s1);
        sw.Flush();
        abcWithoutBom = ms.ToArray();
        Console.WriteLine(FormatArray(abcWithoutBom)); // 61, 62, 63
    }

    var restore1 = Encoding.UTF8.GetString(abcWithoutBom);
    Console.WriteLine(restore1.Length); // 3
    Console.WriteLine(restore1); // abc

    var restore2 = Encoding.UTF8.GetString(abcWithBom);
    Console.WriteLine(restore2.Length); // 4 (!)
    Console.WriteLine(restore2); // ?abc
}

private static string FormatArray(byte[] bytes1)
{
    return string.Join(", ", from b in bytes1 select b.ToString("x"));
}

score 32 · Accepted Answer · answered Jul 28 '12 at 13:47

It looks like this method ignores the BOM (Byte Order Mark), which might be a part of a legitimate binary representation of a UTF8 string, and takes it as a character.

It doesn't look like it "ignores" it at all - it faithfully converts it to the BOM character. That's what it is, after all.

If you want to make your code ignore the BOM in any string it converts, that's up to you to do... or use StreamReader.

Note that if you either use Encoding.GetBytes followed by Encoding.GetString or use StreamWriter followed by StreamReader, both forms will either produce then swallow or not produce the BOM. It's only when you mix using a StreamWriter (which uses Encoding.GetPreamble) with a direct Encoding.GetString call that you end up with the "extra" character.

@RonKlein Additionally, you could say `restore2 = restore2.TrimStart('\uFEFF')` to remove leading BOM characters. I have also at one time wondered why `(new UTF8Encoding(true)).GetBytes("abc")` and `(new UTF8Encoding(false)).GetBytes("abc")` produce the same output, but as you probably know by now, `GetBytes` does not assume you're in the beginning of a file, so it never uses `GetPreamble`. You have to `GetPreamble` explicitly, or skip the preamble explicitly, if you use `GetBytes`, or `GetString`. — Jeppe Stig Nielsen, Jul 29 '12 at 12:45

score 10 · Answer 2 · edited Jun 13 '17 at 15:08

10

Based on the answer by Jon Skeet (thanks!), this is how I just did it:

var memoryStream = new MemoryStream(byteArray);
var s = new StreamReader(memoryStream).ReadToEnd();

Note that this will probably only work reliably if there is a BOM in the byte array you are reading from. If not, you might want to look into another StreamReader constructor overload which takes an Encoding parameter so you can tell it what the byte array contains.

edited Jun 13 '17 at 15:08

drzaus

24,171
16
142
201

answered Nov 20 '13 at 15:36

Per Lundberg

3,837
1
36
46

1

I think you might want [this constructor overload](https://msdn.microsoft.com/en-us/library/ms143457(v=vs.110).aspx) instead which lets you specify whether it should look for a BOM to figure out encoding. – drzaus Jun 13 '17 at 15:07

score 7 · Answer 3 · answered Dec 12 '18 at 10:32

for those who do not want to use streams I found a quite simple solution using Linq:

public static string GetStringExcludeBOMPreamble(this Encoding encoding, byte[] bytes)
{
    var preamble = encoding.GetPreamble();
    if (preamble?.Length > 0 && bytes.Length >= preamble.Length && bytes.Take(preamble.Length).SequenceEqual(preamble))
    {
        return encoding.GetString(bytes, preamble.Length, bytes.Length - preamble.Length);
    }
    else
    {
        return encoding.GetString(bytes);
    }
}

score 0 · Answer 4 · edited Apr 02 '21 at 18:27

I know I am kind of late to the party but here's the code I am using (feel free to adapt to C#) if you need:

Public Function Serialize(Of YourXMLClass)(ByVal obj As YourXMLClass,
                                                      Optional ByVal omitXMLDeclaration As Boolean = True,
                                                      Optional ByVal omitXMLNamespace As Boolean = True) As String

    Dim serializer As New XmlSerializer(obj.GetType)
    Using memStream As New MemoryStream()
        Dim settings As New XmlWriterSettings() With {
                    .Encoding = Encoding.UTF8,
                    .Indent = True,
                    .omitXMLDeclaration = omitXMLDeclaration}

        Using writer As XmlWriter = XmlWriter.Create(memStream, settings)
            Dim xns As New XmlSerializerNamespaces
            If (omitXMLNamespace) Then xns.Add("", "")
            serializer.Serialize(writer, obj, xns)
        End Using

        Return Encoding.UTF8.GetString(memStream.ToArray())
    End Using
End Function

Public Function Deserialize(Of YourXMLClass)(ByVal obj As YourXMLClass, ByVal xml As String) As YourXMLClass
    Dim result As YourXMLClass
    Dim serializer As New XmlSerializer(GetType(YourXMLClass))

    Using memStream As New MemoryStream()
        Dim bytes As Byte() = Encoding.UTF8.GetBytes(xml.ToArray)
        memStream.Write(bytes, 0, bytes.Count)
        memStream.Seek(0, SeekOrigin.Begin)

        Using reader As XmlReader = XmlReader.Create(memStream)
            result = DirectCast(serializer.Deserialize(reader), YourXMLClass)
        End Using

    End Using
    Return result
End Function

Encoding.UTF8.GetString doesn't take into account the Preamble/BOM

4 Answers4

Linked

Related