2

A half-year ago i faced with annoying problem. And still couldn't fix it. Problem is lying in log4j-logging, where default charset is utf 8.

Sometimes i receiving messages with different encoding, CP1252. (There's no way to change this). Thus logging in utf8 makes the text unreadable. I can fix the encoding somehow, and this text would be readable in the log.

But if i will apply that "encoding fix" to the normal message, it will be messed up. I need to know if that conversion is really needed. Unfortunately, i have no ideas.

VirtualVoid
  • 1,005
  • 3
  • 17
  • 25
  • 4
    It's not possible to reliably *detect* the encoding of a blob of text. You generally have to know what you're dealing with. Presumably you can determine the case where you are receiving messages in CP1252, no? What's the bigger scenario here? – deceze Mar 30 '12 at 01:31
  • Nope. I can't predict it :( As far i remember, normal messages is utf 8 and cp1251. But some of them, probably depend on windows language, that's why they are in CP1252. I can make them readable by conversion 1252->1251->utf8. But it will surely mess up the normal ones. – VirtualVoid Mar 30 '12 at 01:44

2 Answers2

3

As deceze commented there is no reliable way automatically detect encoding of a text.

Most encodings try to use 1 byte for characters, as result same sequence of bytes mean totally different string in different encodings. Pretty much the only thing you can reliably do is to say that "it is not valid UTF8 string", other frequently used encodings do not even have strict rules what byte sequences are/are not valid for it.

You best option is to know encoding of the message. Next option would be to preserve text as byte array next to "utf8 string".

If you have very limited set of encodings to accept (utf8/utf16/cp1252) you can try to use some heuristics to detect - i.e. most English strings in UTF16 will have 0 as every other byte, and you can than try to see if the string is OK as UTF8 - if not - than it is likely the remaining encoding.

Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
  • Seems, it's not bad idea about checking utf 8 string for validity. Which the correct way to do this? – VirtualVoid Mar 30 '12 at 01:49
  • If it is already "String" when it gets to your code it likely too late, but if it is byte array converting to string using Ut8 encoding should do the check too (I don't know how to do it in Java, just assuming it is similar to C#). Also check out http://stackoverflow.com/questions/1677497/guessing-the-encoding-of-text-represented-as-byte-in-java which contains detailed steps and some library references. – Alexei Levenkov Mar 30 '12 at 01:57
  • Here is a technique to use the byte order mark in a file to determine its encoding (not guaranteed to work if the BOM is missing in a non-cp1252 encoded file) http://stackoverflow.com/questions/1835430/byte-order-mark-screws-up-file-reading-in-java. Otherwise, use ICU4J – ee. Mar 30 '12 at 03:23
1

Apache Tika includes an open source encoding detector.

There are also commercial alternatives.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
  • I think you'd have to be pretty desperate to hook up an expensive (NLP-based) encoding detector to a messaging application's loggers. – Stephen C Mar 31 '12 at 02:26