C#: Is there any way to discover what charset encoding a file is using?

Question

Is there any way to discover what charset encoding a file is using?

score 4 · Accepted Answer · answered Aug 28 '09 at 13:54

4

The only way to reliably do this is to look for byte order marks at the start of the text file. (This blob more generally represents the endianness of character encoding used, but also the encoding - e.g. UTF8, UTF16, UTF32). Unfortunately, this method only works for Unicode-based encodings, and nothing before that (for which much less reliable methods must be used).

The StreamReader type supports detecting these marks to determine the encoding - you simply need to pass a flag to the parameter as such:

new System.IO.StreamReader("path", true)

You can then check the value of stremReader.CurrentEncoding to determine the encoding used by the file. Note however that if no byte encoding marks exist, then CurrentEncoding will default to Encoding.Default.

answered Aug 28 '09 at 13:54

Noldorin

144,213
56
264
302

Will this work for all possible encodings? For pre-unicode ones as well? – Valentin V Aug 28 '09 at 13:56
And if, like in most legacy encodings, there are no byte order marks, then you're out in the rain... – Artelius Aug 28 '09 at 13:57
1

@Valentin: I'm afraid this will only differentiate between Unicode encodings. Generally it's assumed to be ANSI otherwise. – Noldorin Aug 28 '09 at 13:58
Note that "ANSI" here has a very specific meaning unrelated to American National Standards, namely "the default codepage on this Windows installation". Could be CP1252, could be something else. – MSalters Aug 28 '09 at 14:50

score 3 · Answer 2 · edited Aug 28 '09 at 14:09

See this: Detecting File Encodings in .NET

From Msdn:

There is no great way to detect an arbitrary ANSI code page, though there have been some attempts to do this based on the probability of certain byte sequences in the middle of text. We don't try that in StreamReader. A few file formats like XML or HTML have a way of specifying the character set on the first line in the file, so Web browsers, databases, and classes like XmlTextReader can read these files correctly. But many text files don't have this type of information built in.

score 0 · Answer 3 · edited May 23 '17 at 12:13

0

There's no way to do this with 100% reliability. You have to decide which cost vs accuracy tradeoffs you are comfortable with. I discuss many possible algorithms (with pros & cons) in this reply: PowerShell search script that ignores binary files

edited May 23 '17 at 12:13

Community

1
1

answered Aug 28 '09 at 13:49

Richard Berg

20,629
2
66
86

score 0 · Answer 4 · answered Aug 28 '09 at 13:53

As Richard indicated, theres no completely reliable way to do this. However, here are some potentially helpful links:

http://www.codeproject.com/KB/recipes/DetectEncoding.aspx

http://www.devhood.com/tutorials/tutorial_details.aspx?tutorial_id=469

http://msdn.microsoft.com/en-us/netframework/aa569610.aspx#Question2

score 0 · Answer 5 · answered Aug 28 '09 at 14:49

I've coded that a while ago in C++, and it got pretty complex. Here's what I do (accepting the first that matches):

Look for Byte Order Marks
Check if the text is valid UTF-32 BE/LE
Check if the text is valid UTF-16 BE/LE
Check if the text is valid UTF-8
Assume current code page

This copes with the many BOM-less text files that are out there, but does not help with text stored with custom ANSI code pages.

For these, there's no deterministic detection possible. E.g. a file saved with "eastern european" encoding and loading on a computer with "western european" default code page will be garbled.

The only possibility to help in this case is let the user select the code page (from a user experience, the best would be letting the user change the assumed encoding when he sees the text).

It works OK on a test set but of course misinterpretations are possible, if unlikely.

Code Pages could be determined by a statistical analysis of the text (e.g. frequency of character pairs and triplets containing non-ASCII characters, or word lists in different languages, but I haven't found any suitable approach trying that.

The Win32 IsTextUnicode is notoriously bad, it checks only for UTF-16, and is probably the culprit behind the "bush hid the facts" thing in notepad.

score 0 · Answer 6 · answered Aug 28 '09 at 19:27

0

As peterchen wrote you should write "bush hide the facts" in the Notepad.exe, save and reopen it to see how difficult is to detect the Encoding.

http://en.wikipedia.org/wiki/Bush_hid_the_facts

answered Aug 28 '09 at 19:27

Jader Dias

88,211
155
421
625

score 0 · Answer 7 · answered Apr 29 '11 at 10:26

To add to the list of Potentially useful links, here's a pretty small class I put together to detect unicode encodings (with or without BOM) vs a default codepage (usually Windows-1252, labelled "ASCII" in .Net as Encodings.ASCII):

http://www.architectshack.com/TextFileEncodingDetector.ashx

It goes a few steps further than the StreamReader default functionality, and is basically exactly what @peterchen describes in his answer above, except this one's C# code:

First check for a BOM, use it if provided
Otherwise, check what Unicode encodings the file COULD be.
For each possible unicode encoding found, check whether that encoding is LIKELY for the provided data (assuming primarily western-european content)
If the "possible" unicode encodings don't look likely, use the default codepage/encoding provided

Sorry this answer's so late - I only recently cleaned up the class and put it online.

C#: Is there any way to discover what charset encoding a file is using?

7 Answers7

Linked