54

In C#, I have a string that I'm obtaining from WebClient.DownloadString. I've tried setting client.Encoding to new UTF8Encoding(false), but that's made no difference - I still end up with a byte order mark for UTF-8 at the beginning of the result string. I need to remove this (to parse the resulting XML with LINQ), and want to do so in memory.

So I have a string that starts with \x00EF\x00BB\x00BF, and I want to remove that if it exists. Right now I'm using

if (xml.StartsWith(ByteOrderMarkUtf8))
{
    xml = xml.Remove(0, ByteOrderMarkUtf8.Length);
}

but that just feels wrong. I've tried all sorts of code with streams, GetBytes, and encodings, and nothing works. Can anyone provide the "right" algorithm to strip a BOM from a string?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
TrueWill
  • 25,132
  • 10
  • 101
  • 150

14 Answers14

71

I recently had issues with the .NET 4 upgrade, but until then the simple answer is

String.Trim()

removes the BOM up until .NET 3.5.

However, in .NET 4 you need to change it slightly:

String.Trim(new char[]{'\uFEFF'});

That will also get rid of the byte order mark, though you may also want to remove the ZERO WIDTH SPACE (U+200B):

String.Trim(new char[]{'\uFEFF','\u200B'});

This you could also use to remove other unwanted characters.

Some further information is from String.Trim Method:

The .NET Framework 3.5 SP1 and earlier versions maintain an internal list of white-space characters that this method trims. Starting with the .NET Framework 4, the method trims all Unicode white-space characters (that is, characters that produce a true return value when they are passed to the Char.IsWhiteSpace method). Because of this change, the Trim method in the .NET Framework 3.5 SP1 and earlier versions removes two characters, ZERO WIDTH SPACE (U+200B) and ZERO WIDTH NO-BREAK SPACE (U+FEFF), that the Trim method in the .NET Framework 4 and later versions does not remove. In addition, the Trim method in the .NET Framework 3.5 SP1 and earlier versions does not trim three Unicode white-space characters: MONGOLIAN VOWEL SEPARATOR (U+180E), NARROW NO-BREAK SPACE (U+202F), and MEDIUM MATHEMATICAL SPACE (U+205F).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
PJUK
  • 1,758
  • 1
  • 16
  • 21
  • 1
    Sorry, your example does not appear to work. Try it with string "\x00EF\x00BB\x00BF" under .NET 4. – TrueWill Feb 04 '11 at 18:14
  • Didn't completely understand the question I've had trouble with the standard BOM and didnt even recognise the \x00EF\x00BB\x00BF madness you had to deal with – PJUK Dec 14 '11 at 13:34
  • 3
    Isn't `'\uFEFF'` the BOM for UTF16, rather than UTF8? – Cocowalla May 18 '13 at 19:06
  • 1
    You know, you're right there, I've never had trouble with the UTF8 BOM (which is on reflection what the question asked - that is indeed the UTF8 one) the UTF16 BOM is what I was having trouble with at the time. – PJUK Jul 02 '13 at 12:04
  • 1
    @Cocowalla The corresponding _bytes_ are `FEFF` in big-endian UTF16, yes, but the preamble _character_ is the same in all encodings. – Nyerguds Jan 02 '17 at 09:20
56

I had some incorrect test data, which caused me some confusion. Based on How to avoid tripping over UTF-8 BOM when reading files I found that this worked:

private readonly string _byteOrderMarkUtf8 =
    Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());

public string GetXmlResponse(Uri resource)
{
    string xml;

    using (var client = new WebClient())
    {
        client.Encoding = Encoding.UTF8;
        xml = client.DownloadString(resource);
    }

    if (xml.StartsWith(_byteOrderMarkUtf8, StringComparison.Ordinal))
    {
        xml = xml.Remove(0, _byteOrderMarkUtf8.Length);
    }

    return xml;
}

Setting the client Encoding property correctly reduces the BOM to a single character. However, XDocument.Parse still will not read that string. This is the cleanest version I've come up with to date.

Community
  • 1
  • 1
TrueWill
  • 25,132
  • 10
  • 101
  • 150
  • 3
    Does not seem to work for me. Even "".StartsWith(_byteOrderMarkUtf8) returns true – pingo May 21 '15 at 08:16
  • 1
    @pingo Just tried your code in LINQPad 4 and it returned False. – TrueWill May 25 '15 at 15:02
  • 2
    Surprisingly, there's an implementation difference in the StartsWith method that produces different results on different operating systems. See http://stackoverflow.com/questions/19495318/startswith-change-in-windows-server-2012 – Rami A. Apr 14 '17 at 18:56
  • 1
    @RamiA. So I need to specify `StringComparison.Ordinal` for `StartsWith`? – TrueWill Apr 17 '17 at 00:20
  • 3
    @TrueWill, yes. Otherwise, the results are different when run on Windows 7 vs. Windows 8 or Windows Server 2012 for example. – Rami A. Apr 17 '17 at 04:17
  • 3
    This is the only approach that worked for me. I used string.Replace() to replace the BOM. Thanks – Daniel Leiszen Nov 25 '22 at 16:12
  • 1
    Good idea, @DanielLeiszen (-: `myString = myString.Replace(Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble()), "");` – Beauty Jan 19 '23 at 16:53
  • I have been concatenating strings for 20 years and in this case nothing special: assembling a dynamic insert statement. "Why now?" is my rhetorical question. – Rodney Apr 13 '23 at 14:10
33

This works as well

int index = xmlResponse.IndexOf('<');
if (index > 0)
{
    xmlResponse = xmlResponse.Substring(index, xmlResponse.Length - index);
}
Vivek Ayer
  • 1,135
  • 11
  • 13
  • 1
    Looks simple to me, solved my problem and I think it will solve for other encodings too – Davi Fiamenghi Jan 12 '12 at 16:16
  • Hi Vivek, could you visit the Tridion StackExchange proposal when you have a minute please? area51.stackexchange.com/proposals/38335/tridion We believe the commitment score requires visits from time to time and so is not including you in "users with > 200 rep" figure. Thanks! – Rob Stevenson-Leggett Apr 11 '12 at 07:17
  • 3
    this code deserves to be put in a frame, WTF! typical from my consulting days... Please rather use @PJUK solution – knocte Nov 07 '12 at 18:02
  • I had an invisible crap character at the beginning of my string and end, so I had to do the code presented here as well as something similar to the end of the string: int closingBracket = result.LastIndexOf('>'); if (result.Length > closingBracket + 1) result = result.Remove(closingBracket + 1); – John Gilmer Mar 20 '19 at 18:30
27

A quick and simple method to remove it directly from a string:

private static string RemoveBom(string p)
{
     string BOMMarkUtf8 = Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble());
     if (p.StartsWith(BOMMarkUtf8, StringComparison.Ordinal))
         p = p.Remove(0, BOMMarkUtf8.Length);
     return p.Replace("\0", "");
}

How to use it:

string yourCleanString=RemoveBom(yourBOMString);

Note that StringComparison.Ordinal is important as, depending on the culture the thread is running under, the BOM can be interpreted as an empty string by StartsWith and will always return true. Ordinal will compare the string using binary sort rules.

ProgrammingLlama
  • 36,677
  • 7
  • 67
  • 86
Tiago Gouvêa
  • 15,036
  • 4
  • 75
  • 81
22

If the variable xml is of type string, you did something wrong already - in a character string, the BOM should not be represented as three separate characters, but as a single code point.

Instead of using DownloadString, use DownloadData, and parse byte arrays instead. The XML parser should recognize the BOM itself, and skip it (except for auto-detecting the document encoding as UTF-8).

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
  • 13
    XDocument.Parse does not have an overload that accepts a byte array. I find the statement "you did something wrong" condescending. I would have expected DownloadString to detect the BOM and select the correct encoding. – TrueWill Aug 23 '09 at 16:42
  • 4
    I think you can get the XDocument also through .Load, passing an XmlReader, which you can get by passing a Stream, for which you can use a MemoryStream. I didn't mean to be condescending; I only tried to point out that the intermediate result that you got is seemingly incorrect, so that the real problem is not that you have to strip those characters, but that they are present in the first place. Perhaps it is the case that there is a flaw in DownloadString, in which case you shouldn't be using it. Perhaps the flaw is in the web server reporting the wrong charset. – Martin v. Löwis Aug 23 '09 at 21:38
  • OK, thanks. I did find I didn't have the client Encoding set correctly for DownloadString, which gave me a single code point (as you mentioned). It's somewhat moot at this point, as the company providing the "REST" service decided to remove the redundant (for XML in utf-8) BOM. – TrueWill Aug 24 '09 at 18:14
  • 1
    good call. Using XDocument.Load worked out quite well for me. It's not necessary to use the XmlReader, though, as XDocument.Load takes a stream for an argument. – Steven Oxley Oct 27 '10 at 22:10
12

I had a very similar problem (I needed to parse an XML document represented as a byte array that had a byte order mark at the beginning of it). I used one of Martin's comments on his answer to come to a solution. I took the byte array I had (instead of converting it to a string) and created a MemoryStream object with it. Then I passed it to XDocument.Load, which worked like a charm. For example, let's say that xmlBytes contains your XML in UTF-8 encoding with a byte mark at the beginning of it. Then, this would be the code to solve the problem:

var stream = new MemoryStream(xmlBytes);
var document = XDocument.Load(stream);

It's that simple.

If starting out with a string, it should still be easy to do (assume xml is your string containing the XML with the byte order mark):

var bytes = Encoding.UTF8.GetBytes(xml);
var stream = new MemoryStream(bytes);
var document = XDocument.Load(stream);
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Steven Oxley
  • 6,563
  • 6
  • 43
  • 55
  • 2
    This worked great for me but I had to add an intermediary StreamReader – ScottB Nov 23 '10 at 22:14
  • ie. var doc = XDocument.Load(new StreamReader(new MemoryStream(batchfile))); – ScottB Nov 23 '10 at 22:15
  • Me too, Steven's code doesn't compile. There is no overload of XDocument.Load() that takes a Stream. – Chris Wenham Jul 15 '11 at 17:33
  • 2
    Here is the documentation for the `XDocument.Load(Stream)` overload: http://msdn.microsoft.com/en-us/library/cc838349.aspx. I guess it's specific to .NET 4, so you must be using .NET 3.5. In that case you would have to use a different overload. – Steven Oxley Jul 19 '11 at 15:29
9

I wrote the following post after coming across this issue.

Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve.

Andrew Thompson
  • 2,396
  • 1
  • 21
  • 23
5

Pass the byte buffer (via DownloadData) to string Encoding.UTF8.GetString(byte[]) to get the string rather than download the buffer as a string. You probably have more problems with your current method than just trimming the byte order mark. Unless you're properly decoding it as I suggest here, Unicode characters will probably be misinterpreted, resulting in a corrupted string.

Martin's answer is better, since it avoids allocating an entire string for XML that still needs to be parsed anyway. The answer I gave best applies to general strings that don't need to be parsed as XML.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Andrew Arnott
  • 80,040
  • 26
  • 132
  • 171
  • 1
    Thank you for your response; unfortunately this did not work. I used DownloadData and that worked; however, Encoding.UTF8.GetString(byte[]) did not strip the BOM. I tried variants with new UTF8Encoding(false) and (true) without success. Please note that this is UTF-8 data - encoding="utf-8" is specified in the XML header, and it parses correctly once the BOM is removed. – TrueWill Aug 23 '09 at 16:47
  • Interesting. I was going to mark this down because I'd been using UTF8Encoding.ASCII.GetString(bytes) which leaves the BOM in but Encoding.UTF8.GetString(bytes) removes it. Upvoted instead – Carl Onager Oct 22 '12 at 14:04
  • In my tests, both `Encoding.UTF8.GetString(byte[] s)` and `new UTF8Encoding(encoderShouldEmitUTF8Identifier: false).GetString(byte[] s)` do not trim BOM. – Yan F. Dec 02 '19 at 02:56
5

It's of course best if you can strip it out while still on the byte array level to avoid unwanted substrings / allocs. But if you already have a string, this is perhaps the easiest and most performant way to handle this.

Usage:

            string feed = ""; // input
            bool hadBOM = FixBOMIfNeeded(ref feed);

            var xElem = XElement.Parse(feed); // now does not fail

    /// <summary>
    /// You can get this or test it originally with: Encoding.UTF8.GetString(Encoding.UTF8.GetPreamble())[0];
    /// But no need, this way we have a constant. As these three bytes `[239, 187, 191]` (a BOM) evaluate to a single C# char.
    /// </summary>
    public const char BOMChar = (char)65279;

    public static bool FixBOMIfNeeded(ref string str)
    {
        if (string.IsNullOrEmpty(str))
            return false;

        bool hasBom = str[0] == BOMChar;
        if (hasBom)
            str = str.Substring(1);

        return hasBom;
    }
Nicholas Petersen
  • 9,104
  • 7
  • 59
  • 69
3

I ran into this when I had a Base64 encoded file to transform into the string. While I could have saved it to a file and then read it correctly, here's the best solution I could think of to get from the byte[] of the file to the string (based lightly on TrueWill's answer):

public static string GetUTF8String(byte[] data)
{
    byte[] utf8Preamble = Encoding.UTF8.GetPreamble();
    if (data.StartsWith(utf8Preamble))
    {
        return Encoding.UTF8.GetString(data, utf8Preamble.Length, data.Length - utf8Preamble.Length);
    }
    else
    {
        return Encoding.UTF8.GetString(data);
    }
}

Where StartsWith(byte[]) is the logical extension:

public static bool StartsWith(this byte[] thisArray, byte[] otherArray)
{
   // Handle invalid/unexpected input
   // (nulls, thisArray.Length < otherArray.Length, etc.)

   for (int i = 0; i < otherArray.Length; ++i)
   {
       if (thisArray[i] != otherArray[i])
       {
           return false;
       }
   }

   return true;
}
ProgrammingLlama
  • 36,677
  • 7
  • 67
  • 86
Timothy
  • 469
  • 5
  • 8
  • I don't see anything restricting the concept here to UTF-8. Since `GetPreamble()` belongs to `Encoding`, it should be possible to genericize to take in the Encoding as a parameter. – Timothy Mar 20 '15 at 21:43
2
StreamReader sr = new StreamReader(strFile, true);
XmlDocument xdoc = new XmlDocument();
xdoc.Load(sr);
siva.k
  • 1,344
  • 14
  • 24
lucasjam
  • 21
  • 1
1

Yet another generic variation to get rid of the UTF-8 BOM preamble:

var preamble = Encoding.UTF8.GetPreamble();
if (!functionBytes.Take(preamble.Length).SequenceEqual(preamble))
    preamble = Array.Empty<Byte>();
return Encoding.UTF8.GetString(functionBytes, preamble.Length, functionBytes.Length - preamble.Length);
Vinicius
  • 1,601
  • 19
  • 19
0

Use a regex replace to filter out any other characters other than the alphanumeric characters and spaces that are contained in a normal certificate thumbprint value:

certficateThumbprint = Regex.Replace(certficateThumbprint, @"[^a-zA-Z0-9\-\s*]", "");

And there you go. Voila!! It worked for me.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
-1

I solved the issue with the following code

using System.Xml.Linq;

void method()
{
    byte[] bytes = GetXmlBytes();
    XDocument doc;
    using (var stream = new MemoryStream(docBytes))
    {
        doc = XDocument.Load(stream);
    }
 }
Oleg Polezky
  • 1,006
  • 14
  • 13