Detecting utf-8 without BOM or with BOM

Question

I'm building a compression program. I want to use LWZ for utf-8 files (any urf-8 files) and BZip for others (usually random binary files). I can't find method to define is file utf8 or not.

I tried this and many other methods all over stackoverflow but they can't do it for me. I can share examples of files that should be recognized as utf 8 and files that should be recognized as "others"

 else if (args[0] != null && args[1] != null)
        {

            if (random binary detected)
            {
                Console.WriteLine("Started Bzip");
                byte[] res = new Bzip2Compressor(65).Compress(File.ReadAllBytes(args[0]));
                File.WriteAllBytes(args[1], res);
                Console.WriteLine("Done!");
                return;
            }
            else //for utf 8 cases (both with bom and without)
            {
                Console.WriteLine("Started LZW");
                byte[] res = LZWCompressor.Compress(File.ReadAllBytes(args[0]));
                File.WriteAllBytes(args[1], res);
                Console.WriteLine("Done");
                return;
            }
        }

Note: i only need to separate utf-8 and all others

EDIT: so i would like to check first n symbols to be invalid utf 8;

var bytes = new byte[1024 * 1024];
new Random().NextBytes(bytes);
File.WriteAllBytes(@"PATH", bytes);

General goal is to detected files cerated like in code above as NOT utf-8 files

In the absence of a BOM, to detect wether a file is entirely UTF-8 text, you will have to read it completely and see whether it has some octet/byte sequences that are not valid for UTF-8. While i have not done it myself, creating a custom encoding instance (using utf-8, of course) with a DecoderExceptionFallback (https://learn.microsoft.com/en-us/dotnet/api/system.text.decoderexceptionfallback?view=netframework-4.7.1) should produce an exception if an invalid byte sequence is encountered when reading the UTF-8 text file with this custom UTF-8-based encoding. — , Dec 27 '18 at 14:52
(It will depend on your application scenario what you accept as an UTF-8 text file if it doesn't have a BOM. Limiting your use case to specific file extensions might be possible. Or perhaps just checking the first couple hundred or thousand characters of the file might be sufficient for you, and you could perhaps accept the risk of data following those first couple hundred/thousand characters not being UTF-8 text - the risk most likely being limited to only a less-than-perfect compression ratio) — , Dec 27 '18 at 14:55
By the way, even if a file starts with a UTF-8 BOM, check whether you can read the first few hundred or thousand characters successfully (as per my first comment). This should protect you from (perhaps rare) occasions where some binary file blob just by accident starts with the same bytes as a UTF-8 BOM. — , Dec 27 '18 at 15:07
Revisiting your question, i wonder why only specifically UTF-8? What about text files using other common text encodings (such as Unicode/UTF-16, ISO-8858, CP-1252, etc..)? If your goal is to achieve the maximum possible compression ratio, why don't you let both compressors run/compete for each file and select the shorter result? You can pre-select a compression method for certain file extensions, like for example ".txt" files are (almost) always text, and ".exe" files are (almost) always binaries. If by a rare chance a ".txt" file is not text, who really cares if it is being LZW compressed? — , Dec 27 '18 at 16:22

Detecting utf-8 without BOM or with BOM

0 Answers0