How to detect wrong encoding declaration?

Question

I am building a ASP.NET webservice loading other webpages and then hand it clients. I have been doing quite well with character code treatment, reading the meta tag from HTML then use that codeset to read the file. But nevertheless, some less educated users just don't understand code sets. They declare a specific encoding method e.g. "gb2312", but in fact, he is just using normal UTF8. When I use gb2312 to decode the text, everything turns out a holy mess. How can I detect whether the text is properly decoded? I loaded that page into my IE, which correctly use UTF-8 to decode the page. How does it achieve that?

http://stackoverflow.com/questions/90838/how-can-i-detect-the-encoding-codepage-of-a-text-file — xxbbcc, May 19 '13 at 01:55

score 0 · Answer 1 · answered Jan 23 '14 at 09:06

Based on the BOM you can tell what encoding is used.

BOM and encoding

If you want to detect character set you could use the C# port of mozilla's character set detector.

CharDetSharp

If you want to make it extra sure that you are using a correct one, you maybe could be looking for special characters that are not supposed to be there. It is not very likely to include "Ã³kÃ©". So you could be looking for such characters and try to use different encoding/character set to process your file.

Actually it is really hard to make your application completely "fool-proof".

How to detect wrong encoding declaration?

1 Answers1