How to detect txt file encoding?

Question

Possible Duplicate:
How can I detect the encoding/codepage of a text file

I have plenty of txt files in directory. I have to find all ones with UTF-8 Encoding. How to achieve that?

Not generally possible. (Unless they have an UTF-8 BOM, but then it's still heuristic.) — Mat, Oct 04 '11 at 12:44
http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file - there are some suggestions but generally it is impossible. — Miserable Variable, Oct 04 '11 at 12:48
I'm not sure that this is a duplicate. The other questions ask the impossible question "how to detect *which* encoding I have", while this one asks the far more sensible "how can I tell *whether* I have UTF-8". — Kerrek SB, Oct 04 '11 at 13:03
Understood. And next question about OpenFileDialog. I created txt file with Unicode encoding. After that I've opeend that file in Notepad and selected SaveAs in Notepad. Why do I see Encoding = "ANSI" in SaveFileDialog? I use Windows 7 — user970742, Oct 04 '11 at 13:49

score 2 · Answer 1 · answered Oct 04 '11 at 12:55

You cannot detect an arbitrary text encoding in full generality, since you can never know what a random bunch of bytes was intended to mean. The only meaningful question you can ask is "can I interpret this data correctly as UTF-8".

The easiest way to answer that is to run any of your favourite encoding converters on the file and check for errors (e.g. iconv() or something from ICU, or whatever C# provides). If you want to be manual, you would have to go through the file byte-by-byte and check if everything forms a correct UTF-8 code sequence. The validation is pretty much the same amount of work as flat-out conversion (to UTF-32), since for proper validation you'll not only have to check that all bytes make up complete code sequences, but also that the encoded value is itself a valid Unicode codepoint.

It's a fun little exercise to write this yourself, but the quickest solution would be to just use a library function.

I feel sorry, but could correct me if I wrong. Having said "since you can never know what a random bunch of bytes was intended to mean." have you meant that text file saves all text in bytes, right? But if we speak about ANSI and UTF-8 it can be clear, that there are differences in 1 char's size: 1 byte and 2 bytes for UTF-8. But how about 1 byte char encodings? It means that when I save text "AAA" in ANSI encoding I saved next bytes "95 95 95" and when I try to open this file in another 1 char encoding where position 95 is char B, I will see BBB, right? — user970742, Nov 22 '11 at 05:30

score 1 · Answer 2 · answered Oct 04 '11 at 12:46

1

In a text file without any meta-data this may be impossible to tell.

answered Oct 04 '11 at 12:46

skaz

21,962
20
69
98

How to detect txt file encoding?

2 Answers2