I'm building a compression program. I want to use LWZ for utf-8 files (any urf-8 files) and BZip for others (usually random binary files). I can't find method to define is file utf8 or not.
I tried this and many other methods all over stackoverflow but they can't do it for me. I can share examples of files that should be recognized as utf 8 and files that should be recognized as "others"
else if (args[0] != null && args[1] != null)
{
if (random binary detected)
{
Console.WriteLine("Started Bzip");
byte[] res = new Bzip2Compressor(65).Compress(File.ReadAllBytes(args[0]));
File.WriteAllBytes(args[1], res);
Console.WriteLine("Done!");
return;
}
else //for utf 8 cases (both with bom and without)
{
Console.WriteLine("Started LZW");
byte[] res = LZWCompressor.Compress(File.ReadAllBytes(args[0]));
File.WriteAllBytes(args[1], res);
Console.WriteLine("Done");
return;
}
}
Note: i only need to separate utf-8 and all others
EDIT: so i would like to check first n symbols to be invalid utf 8;
var bytes = new byte[1024 * 1024];
new Random().NextBytes(bytes);
File.WriteAllBytes(@"PATH", bytes);
General goal is to detected files cerated like in code above as NOT utf-8 files