Effective way to find any file's Encoding

Question

Yes is a most frequent question, and this matter is vague for me and since I don't know much about it.

But i would like a very precise way to find a files Encoding. So precise as Notepad++ is.

possible duplicate of [Java : How to determine the correct charset encoding of a stream](http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream) — Oded, Sep 29 '10 at 20:06
Which encodings? UTF-8 vs UTF-16, big vs little endian? Or are you referring to the old MSDos codepages, such as shift-JIS or Cyrillic etc? — dthorpe, Sep 29 '10 at 20:07
Another possible duplicate: http://stackoverflow.com/questions/436220/python-is-there-a-way-to-determine-the-encoding-of-text-file — Oded, Sep 29 '10 at 20:09
@Oded: Quote "The getEncoding() method will return the encoding which was set up (read the JavaDoc) for the stream. It will not guess the encoding for you.". — Fábio Antunes, Sep 29 '10 at 21:44
@dthorpe: Sorry i wasn't specif, i don't know much about Encoding formats. Find any kind of Encoding basically. — Fábio Antunes, Sep 29 '10 at 21:45
For some background reading, http://www.joelonsoftware.com/articles/Unicode.html is a good read. If there is one thing you should know about text, it's that there is no such thing as plain text. — Martijn, Mar 24 '15 at 16:56
There is only one way to know for sure: find out from the sender/writer. — Tom Blodget, Jun 15 '17 at 16:48

2Toad · Accepted Answer · 2020-04-11T05:37:01.607

191

The StreamReader.CurrentEncoding property rarely returns the correct text file encoding for me. I've had greater success determining a file's endianness, by analyzing its byte order mark (BOM). If the file does not have a BOM, this cannot determine the file's encoding.

*UPDATED 4/08/2020 to include UTF-32LE detection and return correct encoding for UTF-32BE

/// <summary>
/// Determines a text file's encoding by analyzing its byte order mark (BOM).
/// Defaults to ASCII when detection of the text file's endianness fails.
/// </summary>
/// <param name="filename">The text file to analyze.</param>
/// <returns>The detected encoding.</returns>
public static Encoding GetEncoding(string filename)
{
    // Read the BOM
    var bom = new byte[4];
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        file.Read(bom, 0, 4);
    }

    // Analyze the BOM
    if (bom[0] == 0x2b && bom[1] == 0x2f && bom[2] == 0x76) return Encoding.UTF7;
    if (bom[0] == 0xef && bom[1] == 0xbb && bom[2] == 0xbf) return Encoding.UTF8;
    if (bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0 && bom[3] == 0) return Encoding.UTF32; //UTF-32LE
    if (bom[0] == 0xff && bom[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
    if (bom[0] == 0xfe && bom[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
    if (bom[0] == 0 && bom[1] == 0 && bom[2] == 0xfe && bom[3] == 0xff) return new UTF32Encoding(true, true);  //UTF-32BE

    // We actually have no idea what the encoding is if we reach this point, so
    // you may wish to return null instead of defaulting to ASCII
    return Encoding.ASCII;
}

edited Apr 11 '20 at 05:37

answered Oct 09 '13 at 22:30

2Toad

14,799
7
42
42

3

+1. This worked for me too (whereas detectEncodingFromByteOrderMarks did not). I used "new FileStream(filename, FileMode.Open, FileAccess.Read)" to avoid a IOException because the file is read only. – Polyfun Apr 07 '14 at 15:18
73

UTF-8 files can be without BOM, in this case it will return ASCII incorrectly. – user626528 Dec 22 '14 at 02:54
@2Toad I have tested it for UCS-2 Little Endian.. But it returns ASCII only.. which is big issue for me as I have to reject the UCS but accept the ASCII.. Help. – SurajS Mar 09 '15 at 11:18
7

This answer is wrong. Looking at the [reference source](https://github.com/Microsoft/referencesource/blob/master/mscorlib/system/io/streamreader.cs) for `StreamReader`, that implementation is what more people will want. They make new encodings rather than using the existing `Encoding.Unicode` objects, so equality checks will fail (which might rarely happen anyway because, for instance, `Encoding.UTF8` can return different objects), but it (1) doesn't use the really weird UTF-7 format, (2) defaults to UTF-8 if no BOM is found, and (3) can be overridden to use a different default encoding. – hangar Dec 17 '15 at 18:08
7

i had better success with new StreamReader(filename, true).CurrentEncoding – Benoit Mar 10 '16 at 08:22
Strangely enough, if you actually make a `UTF7Encoding` object and perform `GetPreamble()` on it, you get an empty array... – Nyerguds Mar 15 '16 at 16:35
By the way, this is missing little-endian UTF32, which must be tested before little-endian UTF-16 since it starts with the same two bytes. – Nyerguds Mar 15 '16 at 16:38
Seems UTF7 is a gigantic mess; its preamble actually _includes two bits of the first character_. It can't be detected _or_ decoded correctly with a simple preamble check like that. – Nyerguds Mar 17 '16 at 08:13
@blacai to detect UTF-8 without BOM you may want to scan the file like this: http://stackoverflow.com/a/4459679/1703648 – serop May 20 '16 at 15:44
@serop Thanks for the link. The problem I face would require to mix both solutions I suppose. I need to recognize whenever a file is UTF-8 and in case it is, it should not carry BOM... – blfuentes May 23 '16 at 07:00
4

There is a fundamental error in the code; when you detect the ***big-endian*** UTF32 *signature* (`00 00 FE FF`), you return the system-provided `Encoding.UTF32`, which is a ***little-endian*** encoding (as noted [here](https://learn.microsoft.com/en-us/dotnet/api/system.text.encoding.utf32?view=netframework-4.7.1)). And also, as noted by @Nyerguds, you still are not looking for UTF32LE, which has signature `FF FE 00 00` (according to https://en.wikipedia.org/wiki/Byte_order_mark). As that user noted, because it is subsuming, that check must come before the 2-byte checks. – Glenn Slayden Feb 08 '18 at 02:11
`Encoding.UTF32` is [little endian](https://msdn.microsoft.com/en-us/library/system.text.encoding.utf32%28v=vs.110%29.aspx?f=255&MSPPError=-2147217396), therefore should the `if` should be `bom[0] == 0xff && bom[1] == 0xfe && bom[2] == 0` – Anthony Apr 16 '18 at 00:36
1

Works only if BOM present! – Martin.Martinsson Sep 13 '18 at 17:39
Thanks. I don't know if the BOM character list is ok and as complete as it can be, and if endianness is correct. But I find it much cleaner and understandable than other answers that deal with test characters and exception management. I have just removed the return Encoding.ASCII; that I don't like at all, I prefer to return null and deal with files without BOM outside of method. – AFract Sep 25 '18 at 15:38
Disagree with falling back to ASCII. The safer fallback is UTF-8. Running ASCII (even if it uses extended ASCII characters) through a UTF8 decoder almost always produces the right result because a realistic sequence of extended ASCII characters that would be interpreted as a valid UTF-8 character is exceedingly rare. See https://en.wikipedia.org/wiki/UTF-8. – Emperor Eto Aug 07 '20 at 15:18
Is there a BOM for ISO-8859-1 and ISO-8859-2? – linuxman Aug 25 '21 at 08:13
`file.Read(bom, 0, 4)` does not guarantee the first 4 bytes are read. By contract, it returns a minimum of 1 and a maximum of 4 (the given number of) bytes and the filled in number of bytes are returned. This is a rather low level method that needs to be used in a loop, best make an extension method that does this for you. Also make sure that the method does not throw an exception if the file's size is less than 4 bytes. – Christoph Nov 02 '22 at 10:49

Simon Mourier · Answer 2 · 2022-04-27T17:27:36.877

74

The following code works fine for me, using the StreamReader class:

  using (var reader = new StreamReader(fileName, defaultEncodingIfNoBom, true))
  {
      reader.Peek(); // you need this!
      var encoding = reader.CurrentEncoding;
  }

The trick is to use the Peek call, otherwise, .NET has not done anything (and it hasn't read the preamble, the BOM). Of course, if you use any other ReadXXX call before checking the encoding, it works too.

If the file has no BOM, then the defaultEncodingIfNoBom encoding will be used. There is also a StreamReader constructor overload without this argument (in this case, the encoding will by default be set to UTF8 before any read), but I recommend to define what you consider the default encoding in your context.

I have tested this successfully with files with BOM for UTF8, UTF16/Unicode (LE & BE) and UTF32 (LE & BE). It does not work for UTF7.

edited Apr 27 '22 at 17:27

answered May 22 '15 at 09:59

Simon Mourier

132,049
21
248
298

1

I get back what set as default encoding. Could I be missing momething? – Ram Mar 15 '16 at 09:28
1

@DRAM - this can happen if the file has no BOM – Simon Mourier Mar 15 '16 at 14:31
Thanks @Simon Mourier. I dint expect my pdf / any file would not have bom. This link http://stackoverflow.com/questions/4520184/how-to-detect-the-character-encoding-of-a-text-file might be helpful for someone who try to detect without bom. – Ram Mar 16 '16 at 09:42
2

In powershell I had to run $reader.close(), or else it was locked from writing. `foreach($filename in $args) { $reader = [System.IO.StreamReader]::new($filename, [System.Text.Encoding]::default,$true); $peek = $reader.Peek(); $reader.currentencoding | select bodyname,encodingname; $reader.close() }` – js2010 Apr 10 '19 at 21:53
2

@SimonMourier This does not work if encoding of the file is `UTF-8 without BOM` – Ozkan Apr 29 '19 at 08:09
1

@Ozkan - If there's no BOM, you can't guarantee 100% what a file encoding is. That's why I added a `defaultEncodingIfNoBom` parameter in my sample code. It's up to you to decide what that could be depending on your context. It's often UTF-8 these days. – Simon Mourier Apr 29 '19 at 08:21
@OndraStarenko - ANSI encoding can only be detected as a fallback, because it's "no encoding". .NET code only detects encoding with BOMs. – Simon Mourier Nov 12 '19 at 07:45
This doesn't work for me neither, always returning back UTF8, when I KNOW (cause I can see in different editors) that it's not that. – PHenry Jun 16 '21 at 20:30
@PHenry - A file w/o a BOM can be seen as basically anything (you will get back what you passed as defaultEncodingIfNoBom). The fact that *you know* is irrelevant, only the binary is. You can post your file somewhere so we can investigate more. – Simon Mourier Jun 17 '21 at 14:39
1

If you use the overload which does not take a default encoding, UTF-8 will be used (not ANSI): `new StreamReader(@"C:\Temp\File without BOM.txt", true).CurrentEncoding.EncodingName` returns `Unicode (UTF-8)` – Maxence Apr 27 '22 at 17:00
@Maxence - When I try that today it indeed returns UTF8 but I'm pretty sure I tried that back then and it was ANSI. I think it's related to Windows version, like Notepad which also used ANSI as default and now uses UTF8 (https://kimconnect.com/how-to-set-default-utf-8-encoding-for-new-notepad-documents-when-saving-file/). Anyway I updated my answer. – Simon Mourier Apr 27 '22 at 17:26
It's in the doc (https://learn.microsoft.com/en-us/dotnet/api/system.io.streamreader?view=net-6.0) : > StreamReader defaults to UTF-8 encoding unless specified otherwise, instead of defaulting to the ANSI code page for the current system. – Maxence Apr 29 '22 at 07:54
But in this answer (https://stackoverflow.com/a/13338497/200443) it is said that there was an error in a previous version of the documentation – Maxence Apr 29 '22 at 07:56

score 18 · Answer 3 · answered Sep 07 '18 at 18:09

Providing the implementation details for the steps proposed by @CodesInChaos:

1) Check if there is a Byte Order Mark

2) Check if the file is valid UTF8

3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)

Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8. https://stackoverflow.com/a/4522251/867248 explains the tactic in more details.

using System; using System.IO; using System.Text;

// Using encoding from BOM or UTF8 if no BOM found,
// check if the file is valid, by reading all lines
// If decoding fails, use the local "ANSI" codepage

public string DetectFileEncoding(Stream fileStream)
{
    var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());
    using (var reader = new StreamReader(fileStream, Utf8EncodingVerifier,
           detectEncodingFromByteOrderMarks: true, leaveOpen: true, bufferSize: 1024))
    {
        string detectedEncoding;
        try
        {
            while (!reader.EndOfStream)
            {
                var line = reader.ReadLine();
            }
            detectedEncoding = reader.CurrentEncoding.BodyName;
        }
        catch (Exception e)
        {
            // Failed to decode the file using the BOM/UT8. 
            // Assume it's local ANSI
            detectedEncoding = "ISO-8859-1";
        }
        // Rewind the stream
        fileStream.Seek(0, SeekOrigin.Begin);
        return detectedEncoding;
   }
}


[Test]
public void Test1()
{
    Stream fs = File.OpenRead(@".\TestData\TextFile_ansi.csv");
    var detectedEncoding = DetectFileEncoding(fs);

    using (var reader = new StreamReader(fs, Encoding.GetEncoding(detectedEncoding)))
    {
       // Consume your file
        var line = reader.ReadLine();
        ...

Thank you! This solved for me. But I would prefer use just `reader.Peek()` instead of `while (!reader.EndOfStream) { var line = reader.ReadLine(); }` — Harison Silva, Feb 06 '19 at 11:46
`reader.Peek()` doesn't read the whole stream. I found that with larger streams, `Peek()` was inadequate. I used `reader.ReadToEndAsync()` instead. — Gary Pendlebury, Apr 25 '19 at 14:54
@PeterMoore Its an encoding for utf8, `var Utf8EncodingVerifier = Encoding.GetEncoding("utf-8", new EncoderExceptionFallback(), new DecoderExceptionFallback());` It is used in the `try` block when reading a line. If the encoder fails to parse the provided text (the text is not encoded with utf8), Utf8EncodingVerifier will throw. The exception is catched and we then know the text is not utf8, and default to ISO-8859-1 — Berthier Lemieux, Aug 09 '20 at 16:56
I'm getting an exception with "Unable to translate bytes [E9] at index 313 from specified code page to Unicode." when using this code on an easy ANSI file. — PHenry, Jun 16 '21 at 20:46

score 16 · Answer 4 · answered Oct 09 '17 at 12:03

16

Check this.

UDE

This is a port of Mozilla Universal Charset Detector and you can use it like this...

public static void Main(String[] args)
{
    string filename = args[0];
    using (FileStream fs = File.OpenRead(filename)) {
        Ude.CharsetDetector cdet = new Ude.CharsetDetector();
        cdet.Feed(fs);
        cdet.DataEnd();
        if (cdet.Charset != null) {
            Console.WriteLine("Charset: {0}, confidence: {1}", 
                 cdet.Charset, cdet.Confidence);
        } else {
            Console.WriteLine("Detection failed.");
        }
    }
}

answered Oct 09 '17 at 12:03

Alexei Agüero Alba

357
2
15

1

You should know that UDE is GPL – lindexi Nov 27 '17 at 07:53
2

Ok if you are worried about the license then you can use this one. Licensed as MIT and you can use it for both open source and closed source software. https://www.nuget.org/packages/SimpleHelpers.FileEncoding/ – Alexei Agüero Alba Nov 28 '17 at 13:38
The license is MPL with a GPL option. `The library is subject to the Mozilla Public License Version 1.1 (the "License"). Alternatively, it may be used under the terms of either the GNU General Public License Version 2 or later (the "GPL"), or the GNU Lesser General Public License Version 2.1 or later (the "LGPL").` – jbtule Jun 10 '19 at 20:59
1

It appears this fork is currently the most active and has a nuget package UDE.Netstandard. https://github.com/yinyue200/ude – jbtule Jun 10 '19 at 21:09
very useful library, coped with a lot of different and unusual encodings! tanks! – mshakurov Jan 28 '20 at 10:04
1

That's nice but I refuse to believe a bloated library is needed for something so simple. – Emperor Eto Aug 07 '20 at 14:18

score 12 · Answer 5 · answered Sep 29 '10 at 20:26

12

I'd try the following steps:

1) Check if there is a Byte Order Mark

2) Check if the file is valid UTF8

3) Use the local "ANSI" codepage (ANSI as Microsoft defines it)

Step 2 works because most non ASCII sequences in codepages other that UTF8 are not valid UTF8.

answered Sep 29 '10 at 20:26

CodesInChaos

106,488
23
218
262

This seems like the more correct answer, as the other answer does not work for me. One can do it with File.OpenRead and .Read-ing the first few bytes of the file. – user420667 Aug 12 '13 at 23:07
2

Step 2 is a whole bunch of programming work to check the bit patterns, though. – Nyerguds Mar 17 '16 at 08:15
@Nyerguds The lazy approach is trying to parse it as UTF-8 and restart from the beginning when you get a decoding error. A bit ugly (exceptions for control flow) and of course the parsing needs to be side-effect free. – CodesInChaos Mar 17 '16 at 08:17
1

I'm not sure decoding actually throws exceptions though, or if it just replaces the unrecognized sequences with '?'. I went with writing a bit pattern checking class, anyway. – Nyerguds Mar 17 '16 at 08:23
4

When you create an instance of `Utf8Encoding` you can pass in an extra parameter that determines if an exception should be thrown or if you prefer silent data corruption. – CodesInChaos Mar 17 '16 at 09:10
1

I like this answer. Most encodings (like 99% of your uses cases probably) will be either UTF-8 or ANSI (Windows codepage 1252). You can check if the string contains the replacement character (0xFFFD) to determine if the encoding failed. – marsze Jan 18 '17 at 09:05
You just list step 2 like it's simple. It's not. – Emperor Eto Aug 07 '20 at 14:16
@PeterMoore As long as you're able and willing to restart at the beginning of the file if it's not UTF-8, it's simple. – CodesInChaos Aug 07 '20 at 19:08

score 8 · Answer 6 · answered Jul 09 '19 at 12:04

.NET is not very helpful, but you can try the following algorithm:

try to find the encoding by BOM(byte order mark) ... very likely not to be found
try parsing into different encodings

Here is the call:

var encoding = FileHelper.GetEncoding(filePath);
if (encoding == null)
    throw new Exception("The file encoding is not supported. Please choose one of the following encodings: UTF8/UTF7/iso-8859-1");

Here is the code:

public class FileHelper
{
    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM) and if not found try parsing into diferent encodings       
    /// Defaults to UTF8 when detection of the text file's endianness fails.
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding or null.</returns>
    public static Encoding GetEncoding(string filename)
    {
        var encodingByBOM = GetEncodingByBOM(filename);
        if (encodingByBOM != null)
            return encodingByBOM;

        // BOM not found :(, so try to parse characters into several encodings
        var encodingByParsingUTF8 = GetEncodingByParsing(filename, Encoding.UTF8);
        if (encodingByParsingUTF8 != null)
            return encodingByParsingUTF8;

        var encodingByParsingLatin1 = GetEncodingByParsing(filename, Encoding.GetEncoding("iso-8859-1"));
        if (encodingByParsingLatin1 != null)
            return encodingByParsingLatin1;

        var encodingByParsingUTF7 = GetEncodingByParsing(filename, Encoding.UTF7);
        if (encodingByParsingUTF7 != null)
            return encodingByParsingUTF7;

        return null;   // no encoding found
    }

    /// <summary>
    /// Determines a text file's encoding by analyzing its byte order mark (BOM)  
    /// </summary>
    /// <param name="filename">The text file to analyze.</param>
    /// <returns>The detected encoding.</returns>
    private static Encoding GetEncodingByBOM(string filename)
    {
        // Read the BOM
        var byteOrderMark = new byte[4];
        using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
        {
            file.Read(byteOrderMark, 0, 4);
        }

        // Analyze the BOM
        if (byteOrderMark[0] == 0x2b && byteOrderMark[1] == 0x2f && byteOrderMark[2] == 0x76) return Encoding.UTF7;
        if (byteOrderMark[0] == 0xef && byteOrderMark[1] == 0xbb && byteOrderMark[2] == 0xbf) return Encoding.UTF8;
        if (byteOrderMark[0] == 0xff && byteOrderMark[1] == 0xfe) return Encoding.Unicode; //UTF-16LE
        if (byteOrderMark[0] == 0xfe && byteOrderMark[1] == 0xff) return Encoding.BigEndianUnicode; //UTF-16BE
        if (byteOrderMark[0] == 0 && byteOrderMark[1] == 0 && byteOrderMark[2] == 0xfe && byteOrderMark[3] == 0xff) return Encoding.UTF32;

        return null;    // no BOM found
    }

    private static Encoding GetEncodingByParsing(string filename, Encoding encoding)
    {            
        var encodingVerifier = Encoding.GetEncoding(encoding.BodyName, new EncoderExceptionFallback(), new DecoderExceptionFallback());

        try
        {
            using (var textReader = new StreamReader(filename, encodingVerifier, detectEncodingFromByteOrderMarks: true))
            {
                while (!textReader.EndOfStream)
                {                        
                    textReader.ReadLine();   // in order to increment the stream position
                }

                // all text parsed ok
                return textReader.CurrentEncoding;
            }
        }
        catch (Exception ex) { }

        return null;    // 
    }
}

score 4 · Answer 7 · answered Dec 27 '22 at 12:57

The solution proposed by @nonoandy is really interesting, I have succesfully tested it and seems to be working perfectly.

The nuget package needed is Microsoft.ProgramSynthesis.Detection (version 8.17.0 at the moment)

I suggest to use the EncodingTypeUtils.GetDotNetName instead of using a switch for getting the Encoding instance:

using System.Text;
using Microsoft.ProgramSynthesis.Detection.Encoding;

...

public Encoding? DetectEncoding(Stream stream)
{
    try
    {
        if (stream.CanSeek)
        {
            // Read from the beginning if possible
            stream.Seek(0, SeekOrigin.Begin);
        }

        // Detect encoding type (enum)
        var encodingType = EncodingIdentifier.IdentifyEncoding(stream);
        
        // Get the corresponding encoding name to be passed to System.Text.Encoding.GetEncoding
        var encodingDotNetName = EncodingTypeUtils.GetDotNetName(encodingType);

        if (!string.IsNullOrEmpty(encodingDotNetName))
        {
            return Encoding.GetEncoding(encodingDotNetName);
        }
    }
    catch (Exception e)
    {
        // Handle exception (log, throw, etc...)
    }

    // In case of error return null or a default value
    return null;
}

score 3 · Answer 8 · answered Jun 28 '16 at 13:31

Look here for c#

https://msdn.microsoft.com/en-us/library/system.io.streamreader.currentencoding%28v=vs.110%29.aspx

string path = @"path\to\your\file.ext";

using (StreamReader sr = new StreamReader(path, true))
{
    while (sr.Peek() >= 0)
    {
        Console.Write((char)sr.Read());
    }

    //Test for the encoding after reading, or at least
    //after the first read.
    Console.WriteLine("The encoding used was {0}.", sr.CurrentEncoding);
    Console.ReadLine();
    Console.WriteLine();
}

score 1 · Answer 9 · answered Jun 15 '17 at 14:21

The following codes are my Powershell codes to determinate if some cpp or h or ml files are encodeding with ISO-8859-1(Latin-1) or UTF-8 without BOM, if neither then suppose it to be GB18030. I am a Chinese working in France and MSVC saves as Latin-1 on french computer and saves as GB on Chinese computer so this helps me avoid encoding problem when do source file exchanges between my system and my colleagues.

The way is simple, if all characters are between x00-x7E, ASCII, UTF-8 and Latin-1 are all the same, but if I read a non ASCII file by UTF-8, we will find the special character � show up, so try to read with Latin-1. In Latin-1, between \x7F and \xAF is empty, while GB uses full between x00-xFF so if I got any between the two, it's not Latin-1

The code is written in PowerShell, but uses .net so it's easy to be translated into C# or F#

$Utf8NoBomEncoding = New-Object System.Text.UTF8Encoding($False)
foreach($i in Get-ChildItem .\ -Recurse -include *.cpp,*.h, *.ml) {
    $openUTF = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::UTF8)
    $contentUTF = $openUTF.ReadToEnd()
    [regex]$regex = '�'
    $c=$regex.Matches($contentUTF).count
    $openUTF.Close()
    if ($c -ne 0) {
        $openLatin1 = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('ISO-8859-1'))
        $contentLatin1 = $openLatin1.ReadToEnd()
        $openLatin1.Close()
        [regex]$regex = '[\x7F-\xAF]'
        $c=$regex.Matches($contentLatin1).count
        if ($c -eq 0) {
            [System.IO.File]::WriteAllLines($i, $contentLatin1, $Utf8NoBomEncoding)
            $i.FullName
        } 
        else {
            $openGB = New-Object System.IO.StreamReader -ArgumentList ($i, [Text.Encoding]::GetEncoding('GB18030'))
            $contentGB = $openGB.ReadToEnd()
            $openGB.Close()
            [System.IO.File]::WriteAllLines($i, $contentGB, $Utf8NoBomEncoding)
            $i.FullName
        }
    }
}
Write-Host -NoNewLine 'Press any key to continue...';
$null = $Host.UI.RawUI.ReadKey('NoEcho,IncludeKeyDown');

score 1 · Answer 10 · answered Apr 18 '21 at 22:33

This seems to work well.

First create a helper method:

  private static Encoding TestCodePage(Encoding testCode, byte[] byteArray)
    {
      try
      {
        var encoding = Encoding.GetEncoding(testCode.CodePage, EncoderFallback.ExceptionFallback, DecoderFallback.ExceptionFallback);
        var a = encoding.GetCharCount(byteArray);
        return testCode;
      }
      catch (Exception e)
      {
        return null;
      }
    }

Then create code to test the source. In this case, I've got a byte array I need to get the encoding of:

 public static Encoding DetectCodePage(byte[] contents)
    {
      if (contents == null || contents.Length == 0)
      {
        return Encoding.Default;
      }

      return TestCodePage(Encoding.UTF8, contents)
             ?? TestCodePage(Encoding.Unicode, contents)
             ?? TestCodePage(Encoding.BigEndianUnicode, contents)
             ?? TestCodePage(Encoding.GetEncoding(1252), contents) // Western European
             ?? TestCodePage(Encoding.GetEncoding(28591), contents) // ISO Western European
             ?? TestCodePage(Encoding.ASCII, contents)
             ?? TestCodePage(Encoding.Default, contents); // likely Unicode
    }

Hmmm... this returns Unicode for me when the file is ANSI – komodosp Oct 11 '21 at 09:33 — komodosp, Oct 11 '21 at 09:33
@colmde Let us know if you were able to solve the issue! – Glen Little Nov 19 '21 at 17:26 — Glen Little, Nov 19 '21 at 17:26

score 0 · Answer 11 · answered Jun 18 '22 at 03:30

I have tried a few different ways to detect encoding and hit issues with most of them.

I made the following leveraging a Microsoft Nuget Package and it seems to work for me so far but needs tested a lot more.
Most of my testing has been on UTF8, UTF8 with BOM and ASNI.

static void Main(string[] args)
{
    var path = Directory.GetCurrentDirectory() + "\\TextFile2.txt";
    List<string> contents = File.ReadLines(path, GetEncoding(path)).Where(w => !string.IsNullOrWhiteSpace(w)).ToList();

    int i = 0;
    foreach (var line in contents)
    {
        i++;
        Console.WriteLine(line);
        if (i > 100)
            break;
    }

}


public static Encoding GetEncoding(string filename)
{
    using (var file = new FileStream(filename, FileMode.Open, FileAccess.Read))
    {
        var detectedEncoding = Microsoft.ProgramSynthesis.Detection.Encoding.EncodingIdentifier.IdentifyEncoding(file);
        switch (detectedEncoding)
        {
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf8:
                return Encoding.UTF8;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Be:
                return Encoding.BigEndianUnicode;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf16Le:
                return Encoding.Unicode;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Utf32Le:
                return Encoding.UTF32;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Ascii:
                return Encoding.ASCII;
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Iso88591:
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Unknown:
            case Microsoft.ProgramSynthesis.Detection.Encoding.EncodingType.Windows1252:
            default:
            return Encoding.Default;
        }
    }
}

score -2 · Answer 12 · edited Jun 11 '19 at 20:10

-2

It may be useful

string path = @"address/to/the/file.extension";

using (StreamReader sr = new StreamReader(path))
{ 
    Console.WriteLine(sr.CurrentEncoding);                        
}

edited Jun 11 '19 at 20:10

deHaar

17,687
10
38
51

answered Jun 11 '19 at 19:48

raushan

1

1

Regardless of the encoding of a file, this always evaluates to *UTF-8*, at least on my environment. – René Nyffenegger Jan 10 '22 at 07:39

Effective way to find any file's Encoding

12 Answers12

Linked

Related