How to detect which character set encoding in Java?

Question

Does anybody know if there is a simple way to detect character set encoding in Java? It seems to me that some programs have the ability to detect which character set a given piece of data uses, or at least make an aproximation.

I suppose the underlying mechanism would have to decode the data in each character set and pick whichever one has the least undefined characters followed by which character set is more common to break a tie.

Any ideas?

What input are we talking about? Byte array (binary) or char array (String)? Which ones would you like to distinguish then? It can namely be done for only Unicode charsets (with byte order marks), but not reliability for others. — BalusC, Feb 12 '10 at 00:06
This can be tricky. Over at this site pfarland is using some heuristics: http://forums.sun.com/thread.jspa?threadID=279203#3 — mre, Feb 12 '10 at 00:10
Related topics: http://stackoverflow.com/questions/499010/java-how-to-determine-the-correct-charset-encoding-of-a-stream and http://stackoverflow.com/questions/1888189/java-readers-and-encodings — BalusC, Feb 12 '10 at 00:28

score 1 · Answer 1 · answered Feb 12 '10 at 08:12

1

Take a look at jchardet, a library ported from the Mozilla browser that specializes in "guessing" the charset of a document.

As an alternative, the cpdetector library, a bit newer, specializes in detecting the code page of a document.

answered Feb 12 '10 at 08:12

Sylar

2,273
2
18
26

score -3 · Accepted Answer · answered Feb 12 '10 at 01:44

-3

For finding whether data is in any unicode format( UTF-8,UTF-16... etc) you can read the data in byte stream and check the first 4 bytes( BOM size) , and for each encoding it will be different

for eg:

for UTF-8 first 3 bytes will be EF,BB,BF

for encodings other than unicode encodings i am not sure...

answered Feb 12 '10 at 01:44

sreejith

716
5
20

4

The optional UTF-8 BOM is only useful if it is present: http://en.wikipedia.org/wiki/Byte_order_mark – trashgod Feb 12 '10 at 03:03
@sreejith.. the BOM solution above can only be used to tell that a file is not UTF-8(in which case it wont start with the given BOM). But if the BOM is present it can be either UTF-8 or not. For e.g. maybe for some other file the initial bytes "EF,BB,BF" are actually valid data.! – Suraj Chandran Feb 18 '11 at 07:03

How to detect which character set encoding in Java?

2 Answers2

Linked