Remove Byte Order Mark from a File.ReadAllBytes (byte[])

Question

I have an HTTPHandler that is reading in a set of CSS files and combining them and then GZipping them. However, some of the CSS files contain a Byte Order Mark (due to a bug in TFS 2005 auto merge) and in FireFox the BOM is being read as part of the actual content so it's screwing up my class names etc. How can I strip out the BOM characters? Is there an easy way to do this without manually going through the byte array looking for "ï»¿"?

Is the BOM appearing in the actual text itself, or just at the very start? I'd be surprised to see it anywhere other than at the start of the data - in which case simply ignoring the first 3 bytes (assuming UTF-8) should do the trick. — Jon Skeet, Nov 13 '08 at 20:14
FWIW, you could open the files in [Notepad++](http://notepad-plus.sourceforge.net/uk/site.htm) and save them without the Byte Order Mark. It's what I had to do in [this question](http://stackoverflow.com/questions/291455/xml-data-at-root-level-is-invalid). — George Stocker, Nov 16 '08 at 22:56
I wrote the [following post](http://andrewmatthewthompson.blogspot.com/2011/02/byte-order-mark-found-using-net.html) after coming across this issue. Essentially instead of reading in the raw bytes of the file's contents using the BinaryReader class, I use the StreamReader class with a specific constructor which automatically removes the byte order mark character from the textual data I am trying to retrieve. — Andrew Thompson, Feb 20 '11 at 21:06

score 8 · Answer 1 · edited May 23 '17 at 11:48

8

Expanding on Jon's comment with a sample.

var name = GetFileName();
var bytes = System.IO.File.ReadAllBytes(name);
System.IO.File.WriteAllBytes(name, bytes.Skip(3).ToArray());

edited May 23 '17 at 11:48

Community

1
1

answered Nov 14 '08 at 02:54

JaredPar

733,204
149
1,241
1,454

7

Quote OP: *However, some of the CSS files contain a Byte Order Mark*. .. ** some ** .. so the code above doesn't check if there's a BOM, before it skips it... – Pure.Krome Aug 10 '14 at 11:24
But UTF-32 has a 4-byte BOM. In this case you have to skip 4 – Legends Feb 12 '23 at 13:01

score 6 · Answer 2 · answered May 19 '10 at 08:23

Expanding JaredPar sample to recurse over sub-directories:

using System.Linq;
using System.IO;
namespace BomRemover
{
    /// <summary>
    /// Remove UTF-8 BOM (EF BB BF) of all *.php files in current & sub-directories.
    /// </summary>
    class Program
    {
        private static void removeBoms(string filePattern, string directory)
        {
            foreach (string filename in Directory.GetFiles(directory, file  Pattern))
            {
                var bytes = System.IO.File.ReadAllBytes(filename);
                if(bytes.Length > 2 && bytes[0] == 0xEF && bytes[1] == 0xBB && bytes[2] == 0xBF)
                {
                    System.IO.File.WriteAllBytes(filename, bytes.Skip(3).ToArray()); 
                }
            }
            foreach (string subDirectory in Directory.GetDirectories(directory))
            {
                removeBoms(filePattern, subDirectory);
            }
        }
        static void Main(string[] args)
        {
            string filePattern = "*.php";
            string startDirectory = Directory.GetCurrentDirectory();
            removeBoms(filePattern, startDirectory);            
        }       
    }
}

I had need that C# piece of code after discovering that the UTF-8 BOM corrupts file when you try to do a basic PHP download file.

score 3 · Answer 3 · answered Jul 16 '09 at 09:50

3

var text = File.ReadAllText(args.SourceFileName);
var streamWriter = new StreamWriter(args.DestFileName, args.Append, new UTF8Encoding(false));
streamWriter.Write(text);
streamWriter.Close();

answered Jul 16 '09 at 09:50

Looking at this code, ideally it should work. But, I am surprised that it is saving file in ANSI format. – VJOY Mar 13 '10 at 07:42
`new UTF8Encoding(false)` the parameter indicates whether to add the BOM or not. – Guy Lowe Apr 04 '14 at 01:18

score 1 · Answer 4 · answered Nov 14 '08 at 08:32

1

Another way, assuming UTF-8 to ASCII.

File.WriteAllText(filename, File.ReadAllText(filename, Encoding.UTF8), Encoding.ASCII);

answered Nov 14 '08 at 08:32

Tim Bailey

571
1
3
15

score 0 · Answer 5 · answered Mar 14 '18 at 13:37

For larger file, use the following code; memory efficient!

StreamReader sr = new StreamReader(path: @"<Input_file_full_path_with_byte_order_mark>", 
                    detectEncodingFromByteOrderMarks: true);

StreamWriter sw = new StreamWriter(path: @"<Output_file_without_byte_order_mark>", 
                    append: false, 
                    encoding: new UnicodeEncoding(bigEndian: false, byteOrderMark: false));

var lineNumber = 0;
while (!sr.EndOfStream)
{
    sw.WriteLine(sr.ReadLine());
    lineNumber += 1;
    if (lineNumber % 100000 == 0)
        Console.Write("\rLine# " + lineNumber.ToString("000000000000"));
}

sw.Flush();
sw.Close();

Remove Byte Order Mark from a File.ReadAllBytes (byte[])

5 Answers5

Linked