Is there any way to discover what charset encoding a file is using?
7 Answers
The only way to reliably do this is to look for byte order marks at the start of the text file. (This blob more generally represents the endianness of character encoding used, but also the encoding - e.g. UTF8, UTF16, UTF32). Unfortunately, this method only works for Unicode-based encodings, and nothing before that (for which much less reliable methods must be used).
The StreamReader
type supports detecting these marks to determine the encoding - you simply need to pass a flag to the parameter as such:
new System.IO.StreamReader("path", true)
You can then check the value of stremReader.CurrentEncoding
to determine the encoding used by the file. Note however that if no byte encoding marks exist, then CurrentEncoding
will default to Encoding.Default
.

- 144,213
- 56
- 264
- 302
-
Will this work for all possible encodings? For pre-unicode ones as well? – Valentin V Aug 28 '09 at 13:56
-
And if, like in most legacy encodings, there are no byte order marks, then you're out in the rain... – Artelius Aug 28 '09 at 13:57
-
1@Valentin: I'm afraid this will only differentiate between Unicode encodings. Generally it's assumed to be ANSI otherwise. – Noldorin Aug 28 '09 at 13:58
-
Note that "ANSI" here has a very specific meaning unrelated to American National Standards, namely "the default codepage on this Windows installation". Could be CP1252, could be something else. – MSalters Aug 28 '09 at 14:50
See this: Detecting File Encodings in .NET
From Msdn:
There is no great way to detect an arbitrary ANSI code page, though there have been some attempts to do this based on the probability of certain byte sequences in the middle of text. We don't try that in StreamReader. A few file formats like XML or HTML have a way of specifying the character set on the first line in the file, so Web browsers, databases, and classes like XmlTextReader can read these files correctly. But many text files don't have this type of information built in.
There's no way to do this with 100% reliability. You have to decide which cost vs accuracy tradeoffs you are comfortable with. I discuss many possible algorithms (with pros & cons) in this reply: PowerShell search script that ignores binary files

- 1
- 1

- 20,629
- 2
- 66
- 86
As Richard indicated, theres no completely reliable way to do this. However, here are some potentially helpful links:
http://www.codeproject.com/KB/recipes/DetectEncoding.aspx
http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=469
http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2

- 17,133
- 6
- 43
- 60
I've coded that a while ago in C++, and it got pretty complex. Here's what I do (accepting the first that matches):
- Look for Byte Order Marks
- Check if the text is valid UTF-32 BE/LE
- Check if the text is valid UTF-16 BE/LE
- Check if the text is valid UTF-8
- Assume current code page
This copes with the many BOM-less text files that are out there, but does not help with text stored with custom ANSI code pages.
For these, there's no deterministic detection possible. E.g. a file saved with "eastern european" encoding and loading on a computer with "western european" default code page will be garbled.
The only possibility to help in this case is let the user select the code page (from a user experience, the best would be letting the user change the assumed encoding when he sees the text).
It works OK on a test set but of course misinterpretations are possible, if unlikely.
Code Pages could be determined by a statistical analysis of the text (e.g. frequency of character pairs and triplets containing non-ASCII characters, or word lists in different languages, but I haven't found any suitable approach trying that.
The Win32 IsTextUnicode is notoriously bad, it checks only for UTF-16, and is probably the culprit behind the "bush hid the facts" thing in notepad.

- 40,917
- 20
- 104
- 186
As peterchen wrote you should write "bush hide the facts" in the Notepad.exe, save and reopen it to see how difficult is to detect the Encoding.

- 88,211
- 155
- 421
- 625
To add to the list of Potentially useful links, here's a pretty small class I put together to detect unicode encodings (with or without BOM) vs a default codepage (usually Windows-1252, labelled "ASCII" in .Net as Encodings.ASCII):
http://www.architectshack.com/TextFileEncodingDetector.ashx
It goes a few steps further than the StreamReader default functionality, and is basically exactly what @peterchen describes in his answer above, except this one's C# code:
- First check for a BOM, use it if provided
- Otherwise, check what Unicode encodings the file COULD be.
- For each possible unicode encoding found, check whether that encoding is LIKELY for the provided data (assuming primarily western-european content)
- If the "possible" unicode encodings don't look likely, use the default codepage/encoding provided
Sorry this answer's so late - I only recently cleaned up the class and put it online.

- 13,457
- 7
- 65
- 76