2

Possible Duplicate:
How can I detect the encoding/codepage of a text file

I have plenty of txt files in directory. I have to find all ones with UTF-8 Encoding. How to achieve that?

Community
  • 1
  • 1
user970742
  • 419
  • 1
  • 5
  • 10
  • 4
    Not generally possible. (Unless they have an UTF-8 BOM, but then it's still heuristic.) – Mat Oct 04 '11 at 12:44
  • 3
    http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file - there are some suggestions but generally it is impossible. – Miserable Variable Oct 04 '11 at 12:48
  • 1
    I'm not sure that this is a duplicate. The other questions ask the impossible question "how to detect *which* encoding I have", while this one asks the far more sensible "how can I tell *whether* I have UTF-8". – Kerrek SB Oct 04 '11 at 13:03
  • Understood. And next question about OpenFileDialog. I created txt file with Unicode encoding. After that I've opeend that file in Notepad and selected SaveAs in Notepad. Why do I see Encoding = "ANSI" in SaveFileDialog? I use Windows 7 – user970742 Oct 04 '11 at 13:49

2 Answers2

2

You cannot detect an arbitrary text encoding in full generality, since you can never know what a random bunch of bytes was intended to mean. The only meaningful question you can ask is "can I interpret this data correctly as UTF-8".

The easiest way to answer that is to run any of your favourite encoding converters on the file and check for errors (e.g. iconv() or something from ICU, or whatever C# provides). If you want to be manual, you would have to go through the file byte-by-byte and check if everything forms a correct UTF-8 code sequence. The validation is pretty much the same amount of work as flat-out conversion (to UTF-32), since for proper validation you'll not only have to check that all bytes make up complete code sequences, but also that the encoded value is itself a valid Unicode codepoint.

It's a fun little exercise to write this yourself, but the quickest solution would be to just use a library function.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • I feel sorry, but could correct me if I wrong. Having said "since you can never know what a random bunch of bytes was intended to mean." have you meant that text file saves all text in bytes, right? But if we speak about ANSI and UTF-8 it can be clear, that there are differences in 1 char's size: 1 byte and 2 bytes for UTF-8. But how about 1 byte char encodings? It means that when I save text "AAA" in ANSI encoding I saved next bytes "95 95 95" and when I try to open this file in another 1 char encoding where position 95 is char B, I will see BBB, right? – user970742 Nov 22 '11 at 05:30
1

In a text file without any meta-data this may be impossible to tell.

skaz
  • 21,962
  • 20
  • 69
  • 98