12

I have a set of markdown files to be passed to jekyll project , need to find the encoding format of them i.e UTF-8 with BOM or UTF-8 without BOM or ANSI using a program or a API .

if i pass the location of the files , the files have to be listed,read and the encoding should be produced as result .

Is there any Code or API for it ?

i have already tried the sr.CurrentEncoding for stream reader as mentioned in Effective way to find any file's Encoding but the result varies with the result from a notepad++ result .

also tried to use https://github.com/errepi/ude ( Mozilla Universal Charset Detector) as suggested in https://social.msdn.microsoft.com/Forums/vstudio/en-US/862e3342-cc88-478f-bca2-e2de6f60d2fb/detect-encoding-of-the-file?forum=csharpgeneral by implementing the ude.dll in the c# project but the result is not effective as in notepad++ , the file encoding is shown as utf-8 , but from the program , the result is utf-8 with BOM.

but i should get same result from both ways , so where the problem has occurred?

Deepak Raj
  • 137
  • 1
  • 1
  • 10
  • 1
    this is not a duplicate of any other questions as i have tried other answers to find encoding and its not working for me properly . – Deepak Raj Jan 22 '18 at 11:34
  • Is there a reason you believe Notepad++ is correct and all the other solutions are incorrect? (In particular, why do you believe the file in question is ANSI and not UTF-8? What are the contents of the file?) This looks like a reverse engineering question to duplicate the specific algorithm used by Notepad++. Since it is a closed-source product, have you approached them for information about their product? – Rob Napier Jan 22 '18 at 13:52
  • "should get same result from both ways": probably not. Guessing programs choose their own algorithms. One thing that most have in common, though, is giving one answer when there are many possibilities. Perhaps that's what's confusing you. It is the author of any text file that chooses the encoding so you could just ask. – Tom Blodget Jan 22 '18 at 17:38
  • @RobNapier, "Since it is a closed-source product" - no [it is not](https://github.com/notepad-plus-plus/notepad-plus-plus). But as it is C++ it, it does look that way. – H H Jan 22 '18 at 20:37
  • Thanks @HenkHolterman. I misread their site! – Rob Napier Jan 22 '18 at 22:23
  • i have files with utf-8 with BOM , without BOM and ANSI , i need to convert the files into html using jekyll project and before sending the files into the project i use notepad++ to ensure the encoding . once utf-8 without boom will get converted to html file . so unless i can get accurate encoding with a program , files with wrong encoding might be sent for the jekyll project – Deepak Raj Jan 23 '18 at 06:39
  • for wrong encoded the files , the project will get fail. so i have to ensure encoding by program before passing them for jekyll project and run it – Deepak Raj Jan 23 '18 at 06:40
  • Ask a more focused question, this one is about detecting an Encoding. What would you do if you had that? – H H Jan 24 '18 at 11:01
  • @DeepakRaj Did you actually check the file in a hex editor? It's pretty easy to see if there's a BOM or not. – Nyerguds Feb 02 '18 at 23:54

3 Answers3

15

Detecting encoding is always a tricky business, but detecting BOMs is dead simple. To get the BOM as byte array, just use the GetPreamble() function of the encoding objects. This should allow you to detect a whole range of encodings by preamble.

Now, as for detecting UTF-8 without preamble, actually that's not very hard either. See, UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect.

So if you first do the BOM check, and then the strict decoding check, and finally fall back to Win-1252 encoding (what you call "ANSI") then your detection is done.

Byte[] bytes = File.ReadAllBytes(filename);
Encoding encoding = null;
String text = null;
// Test UTF8 with BOM. This check can easily be copied and adapted
// to detect many other encodings that use BOMs.
UTF8Encoding encUtf8Bom = new UTF8Encoding(true, true);
Boolean couldBeUtf8 = true;
Byte[] preamble = encUtf8Bom.GetPreamble();
Int32 prLen = preamble.Length;
if (bytes.Length >= prLen && preamble.SequenceEqual(bytes.Take(prLen)))
{
    // UTF8 BOM found; use encUtf8Bom to decode.
    try
    {
        // Seems that despite being an encoding with preamble,
        // it doesn't actually skip said preamble when decoding...
        text = encUtf8Bom.GetString(bytes, prLen, bytes.Length - prLen);
        encoding = encUtf8Bom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
        couldBeUtf8 = false;
    }
}
// use boolean to skip this if it's already confirmed as incorrect UTF-8 decoding.
if (couldBeUtf8 && encoding == null)
{
    // test UTF-8 on strict encoding rules. Note that on pure ASCII this will
    // succeed as well, since valid ASCII is automatically valid UTF-8.
    UTF8Encoding encUtf8NoBom = new UTF8Encoding(false, true);
    try
    {
        text = encUtf8NoBom.GetString(bytes);
        encoding = encUtf8NoBom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
    }
}
// fall back to default ANSI encoding.
if (encoding == null)
{
    encoding = Encoding.GetEncoding(1252);
    text = encoding.GetString(bytes);
}

Note that Windows-1252 (US / Western European ANSI) is a one-byte-per-character encoding, meaning everything in it produces a technically valid character, so unless you go for heuristic methods, no further detection can be done on it to distinguish it from other one-byte-per-character encodings.

Nyerguds
  • 5,360
  • 1
  • 31
  • 63
  • 1
    You can of course add other encodings into these checks, but do be careful; some encodings have BOMs that start the same way as the BOMs of some others, so you have to test them in the right order. I know there's a question on SO somewhere with that list and logic, but I can't find it at the moment. – Nyerguds Jul 27 '18 at 11:26
5

Necromancing.

  • First, you check the Byte-Order Mark:
  • If that doesn't work, you can try to infer the encoding from the text-content with Mozilla Universal Charset Detector C# port.
  • If that doesn't work, you just return the CurrentCulture/InstalledUiCulture/System-Encoding - or whatever.
  • if the system-encoding doesn't work, we can either return ASCII or UTF8. Since entries 0-127 of UTF8 are identical to ASCII, we so simply return UTF8.

Example (DetectOrGuessEncoding):

namespace SQLMerge
{


    class EncodingDetector
    {


        public static System.Text.Encoding BomInfo(string srcFile)
        {
            return BomInfo(srcFile, false);
        } // End Function BomInfo 



        public static System.Text.Encoding BomInfo(string srcFile, bool thorough)
        {
            byte[] b = new byte[5];

            using (System.IO.FileStream file = new System.IO.FileStream(srcFile, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
            {
                int numRead = file.Read(b, 0, 5);
                if (numRead < 5)
                    System.Array.Resize(ref b, numRead);

                file.Close();
            } // End Using file 

            if (b.Length >= 4 && b[0] == 0x00 && b[1] == 0x00 && b[2] == 0xFE && b[3] == 0xFF) // UTF32-BE 
                return System.Text.Encoding.GetEncoding("utf-32BE"); // UTF-32, big-endian 
            else if (b.Length >= 4 && b[0] == 0xFF && b[1] == 0xFE && b[2] == 0x00 && b[3] == 0x00) // UTF32-LE
                return System.Text.Encoding.UTF32; // UTF-32, little-endian
            // https://en.wikipedia.org/wiki/Byte_order_mark#cite_note-14    
            else if (b.Length >= 4 && b[0] == 0x2b && b[1] == 0x2f && b[2] == 0x76 && (b[3] == 0x38 || b[3] == 0x39 || b[3] == 0x2B || b[3] == 0x2F)) // UTF7
                return System.Text.Encoding.UTF7;  // UTF-7
            else if (b.Length >= 3 && b[0] == 0xEF && b[1] == 0xBB && b[2] == 0xBF) // UTF-8
                return System.Text.Encoding.UTF8;  // UTF-8
            else if (b.Length >= 2 && b[0] == 0xFE && b[1] == 0xFF) // UTF16-BE
                return System.Text.Encoding.BigEndianUnicode; // UTF-16, big-endian
            else if (b.Length >= 2 && b[0] == 0xFF && b[1] == 0xFE) // UTF16-LE
                return System.Text.Encoding.Unicode; // UTF-16, little-endian

            // Maybe there is a future encoding ...
            // PS: The above yields more than this - this doesn't find UTF7 ...
            if (thorough)
            {
                System.Collections.Generic.List<System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>> lsPreambles = 
                    new System.Collections.Generic.List<System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>>();

                foreach (System.Text.EncodingInfo ei in System.Text.Encoding.GetEncodings())
                {
                    System.Text.Encoding enc = ei.GetEncoding();

                    byte[] preamble = enc.GetPreamble();

                    if (preamble == null)
                        continue;

                    if (preamble.Length == 0)
                        continue;

                    if (preamble.Length > b.Length)
                        continue;

                    System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp =
                        new System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]>(enc, preamble);

                    lsPreambles.Add(kvp);
                } // Next ei

                // li.Sort((a, b) => a.CompareTo(b)); // ascending sort
                // li.Sort((a, b) => b.CompareTo(a)); // descending sort
                lsPreambles.Sort(
                    delegate (
                        System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp1, 
                        System.Collections.Generic.KeyValuePair<System.Text.Encoding, byte[]> kvp2)
                    {
                        return kvp2.Value.Length.CompareTo(kvp1.Value.Length);
                    }
                );


                for (int j = 0; j < lsPreambles.Count; ++j)
                {
                    for (int i = 0; i < lsPreambles[j].Value.Length; ++i)
                    {
                        if (b[i] != lsPreambles[j].Value[i])
                        {
                            goto NEXT_J_AND_NOT_NEXT_I;
                        }
                    } // Next i 

                    return lsPreambles[j].Key;
                    NEXT_J_AND_NOT_NEXT_I: continue;
                } // Next j 

            } // End if (thorough)

            return null;
        } // End Function BomInfo 


        public static System.Text.Encoding DetectOrGuessEncoding(string fileName)
        {
            return DetectOrGuessEncoding(fileName, false);
        }


        public static System.Text.Encoding DetectOrGuessEncoding(string fileName, bool withOutput)
        {
            if (!System.IO.File.Exists(fileName))
                return null;


            System.ConsoleColor origBack = System.ConsoleColor.Black;
            System.ConsoleColor origFore = System.ConsoleColor.White;
            

            if (withOutput)
            {
                origBack = System.Console.BackgroundColor;
                origFore = System.Console.ForegroundColor;
            }
            
            // System.Text.Encoding systemEncoding = System.Text.Encoding.Default; // Returns hard-coded UTF8 on .NET Core ... 
            System.Text.Encoding systemEncoding = GetSystemEncoding();
            System.Text.Encoding enc = BomInfo(fileName);
            if (enc != null)
            {
                if (withOutput)
                {
                    System.Console.BackgroundColor = System.ConsoleColor.Green;
                    System.Console.ForegroundColor = System.ConsoleColor.White;
                    System.Console.WriteLine(fileName);
                    System.Console.WriteLine(enc);
                    System.Console.BackgroundColor = origBack;
                    System.Console.ForegroundColor = origFore;
                }

                return enc;
            }

            using (System.IO.Stream strm = System.IO.File.OpenRead(fileName))
            {
                UtfUnknown.DetectionResult detect = UtfUnknown.CharsetDetector.DetectFromStream(strm);

                if (detect != null && detect.Details != null && detect.Details.Count > 0 && detect.Details[0].Confidence < 1)
                {
                    if (withOutput)
                    {
                        System.Console.BackgroundColor = System.ConsoleColor.Red;
                        System.Console.ForegroundColor = System.ConsoleColor.White;
                        System.Console.WriteLine(fileName);
                        System.Console.WriteLine(detect);
                        System.Console.BackgroundColor = origBack;
                        System.Console.ForegroundColor = origFore;
                    }

                    foreach (UtfUnknown.DetectionDetail detail in detect.Details)
                    {
                        if (detail.Encoding == systemEncoding
                            || detail.Encoding == System.Text.Encoding.UTF8
                        )
                            return detail.Encoding;
                    }

                    return detect.Details[0].Encoding;
                }
                else if (detect != null && detect.Details != null && detect.Details.Count > 0)
                {
                    if (withOutput)
                    {
                        System.Console.BackgroundColor = System.ConsoleColor.Green;
                        System.Console.ForegroundColor = System.ConsoleColor.White;
                        System.Console.WriteLine(fileName);
                        System.Console.WriteLine(detect);
                        System.Console.BackgroundColor = origBack;
                        System.Console.ForegroundColor = origFore;
                    }

                    return detect.Details[0].Encoding;
                }

                enc = GetSystemEncoding();

                if (withOutput)
                {
                    System.Console.BackgroundColor = System.ConsoleColor.DarkRed;
                    System.Console.ForegroundColor = System.ConsoleColor.Yellow;
                    System.Console.WriteLine(fileName);
                    System.Console.Write("Assuming ");
                    System.Console.Write(enc.WebName);
                    System.Console.WriteLine("...");
                    System.Console.BackgroundColor = origBack;
                    System.Console.ForegroundColor = origFore;
                }

                return systemEncoding;
            } // End Using strm 

        } // End Function DetectOrGuessEncoding 


        public static System.Text.Encoding GetSystemEncoding()
        {
            // The OEM code page for use by legacy console applications
            // int oem = System.Globalization.CultureInfo.CurrentCulture.TextInfo.OEMCodePage;

            // The ANSI code page for use by legacy GUI applications
            // int ansi = System.Globalization.CultureInfo.InstalledUICulture.TextInfo.ANSICodePage; // Machine 
            int ansi = System.Globalization.CultureInfo.CurrentCulture.TextInfo.ANSICodePage; // User 

            try
            {
                // https://stackoverflow.com/questions/38476796/how-to-set-net-core-in-if-statement-for-compilation
#if ( NETSTANDARD && !NETSTANDARD1_0 )  || NETCORE || NETCOREAPP3_0 || NETCOREAPP3_1 
                System.Text.Encoding.RegisterProvider(System.Text.CodePagesEncodingProvider.Instance);
#endif

                System.Text.Encoding enc = System.Text.Encoding.GetEncoding(ansi);
                return enc;
            }
            catch (System.Exception)
            { }


            try
            {

                foreach (System.Text.EncodingInfo ei in System.Text.Encoding.GetEncodings())
                {
                    System.Text.Encoding e = ei.GetEncoding();

                    // 20'127: US-ASCII 
                    if (e.WindowsCodePage == ansi && e.CodePage != 20127)
                    {
                        return e;
                    }

                }
            }
            catch (System.Exception)
            { }

            // return System.Text.Encoding.GetEncoding("iso-8859-1");
            return System.Text.Encoding.UTF8;
        } // End Function GetSystemEncoding 


    } // End Class 


}
Stefan Steiger
  • 78,642
  • 66
  • 377
  • 442
  • The UTF-7 one is technically incorrect anyway; it's four bytes, and the last 2 bits of the 4th belong to the next character, so it has to be checked with bit masks. And the thorough method should pre-sort the encodings by preamble lengths, longest first, or you'll match UTF-16's preamble on UTF-32. – Nyerguds May 20 '21 at 09:29
  • Great reference to the 'detector' package. It seems to be working. – James John McGuire 'Jahmic' Nov 17 '21 at 13:01
  • 2
    @Nyerguds: Damn, you're right on UTF-7. Fixed that, and added pre-sort. One really has to read those wikipedia-tables very very very carefully. – Stefan Steiger Nov 18 '21 at 09:19
  • Note, I don't think anything really uses UTF-7 anyway. Anyone saving text in UTF-7 is getting exactly what they deserve when nothing can open it. – Nyerguds Nov 19 '21 at 11:35
  • 1
    @Nyerguds: Quake3 and Java use it. Now, we can forgive&forget about Quake3, but Java ... For example, that file from the swiss postal services that I imported years ago, that included ZIP-codes and place names ... UTF-7 was my last guess, but the last guess proved right ;) In addition to that, it seems to have been a thing in e-mail, that was good for nothing but security problems. – Stefan Steiger Nov 23 '21 at 09:58
-1
namespace WindowsFormsApp2
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
        
        private void button1_Click(object sender, EventArgs e)
        {
            List<FilePath> filePaths = new List<FilePath>();
            filePaths = GetLstPaths();
        }
        public static List<FilePath> GetLstPaths()
        {
            #region Getting Files

            DirectoryInfo directoryInfo = new DirectoryInfo(@"C:\Users\Safi\Desktop\ss\");
            DirectoryInfo directoryTargetInfo = new DirectoryInfo(@"C:\Users\Safi\Desktop\ss1\");
            FileInfo[] fileInfos = directoryInfo.GetFiles("*.txt");
            List<FilePath> lstFiles = new List<FilePath>();
            foreach (FileInfo fileInfo in fileInfos)
            {
                Encoding enco = GetLittleIndianFiles(directoryInfo + fileInfo.Name);
                string filePath = directoryInfo + fileInfo.Name;
                string targetFilePath = directoryTargetInfo + fileInfo.Name;
                if (enco != null)
                {
                    FilePath f1 = new FilePath();
                    f1.filePath = filePath;
                    f1.targetFilePath = targetFilePath;
                    lstFiles.Add(f1);
                }
            }
            int count = 0;
            lstFiles.ForEach(d =>
            {
                count++;
            });
            MessageBox.Show(Convert.ToString(count) + "Files are Converted");
            #endregion
            return lstFiles;
        }
        public static Encoding GetLittleIndianFiles(string srcFile)
        {
            byte[] b = new byte[5];

            using (System.IO.FileStream file = new System.IO.FileStream(srcFile, System.IO.FileMode.Open, System.IO.FileAccess.Read, System.IO.FileShare.Read))
            {
                int numRead = file.Read(b, 0, 5);
                if (numRead < 5)
                    System.Array.Resize(ref b, numRead);

                file.Close();
            } // End Using file 
            if (b.Length >= 2 && b[0] == 0xFF && b[1] == 0xFE)
                return System.Text.Encoding.Unicode; // UTF-16, little-endian
            return null;
        }
    }

    public class FilePath
    {
        public string filePath { get; set; }
        public string targetFilePath { get; set; }
    }
}
Connell.O'Donnell
  • 3,603
  • 11
  • 27
  • 61