7

Does anybody know if there is a simple way to detect character set encoding in Java? It seems to me that some programs have the ability to detect which character set a given piece of data uses, or at least make an aproximation.

I suppose the underlying mechanism would have to decode the data in each character set and pick whichever one has the least undefined characters followed by which character set is more common to break a tie.

Any ideas?

benstpierre
  • 32,833
  • 51
  • 177
  • 288
  • What input are we talking about? Byte array (binary) or char array (String)? Which ones would you like to distinguish then? It can namely be done for only Unicode charsets (with byte order marks), but not reliability for others. – BalusC Feb 12 '10 at 00:06
  • 1
    This can be tricky. Over at this site pfarland is using some heuristics: http://forums.sun.com/thread.jspa?threadID=279203#3 – mre Feb 12 '10 at 00:10
  • 1
    Related topics: http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream and http://stackoverflow.com/questions/1888189/java-readers-and-encodings – BalusC Feb 12 '10 at 00:28

2 Answers2

1

Take a look at jchardet, a library ported from the Mozilla browser that specializes in "guessing" the charset of a document.

As an alternative, the cpdetector library, a bit newer, specializes in detecting the code page of a document.

Sylar
  • 2,273
  • 2
  • 18
  • 26
-3

For finding whether data is in any unicode format( UTF-8,UTF-16... etc) you can read the data in byte stream and check the first 4 bytes( BOM size) , and for each encoding it will be different

for eg:

for UTF-8 first 3 bytes will be EF,BB,BF

for encodings other than unicode encodings i am not sure...

sreejith
  • 716
  • 5
  • 20
  • 4
    The optional UTF-8 BOM is only useful if it is present: http://en.wikipedia.org/wiki/Byte_order_mark – trashgod Feb 12 '10 at 03:03
  • @sreejith.. the BOM solution above can only be used to tell that a file is not UTF-8(in which case it wont start with the given BOM). But if the BOM is present it can be either UTF-8 or not. For e.g. maybe for some other file the initial bytes "EF,BB,BF" are actually valid data.! – Suraj Chandran Feb 18 '11 at 07:03