4

I was wondering if there are known methods to detect (or give a best guess of) the encoding of a particular string in Java.

I know that you always need some additional meta-data to tell what the encoding is, and there are best practices etc., but the situation I'm in, I need to give the best approximation.

A solution -- or a pointer -- to programatically distinguishing between UTF-8 and UTF-16 is also welcome.

SuPra
  • 8,488
  • 4
  • 37
  • 30

3 Answers3

2

The utf-8 encoding should be easy to verify:

UTF-8 strings can be fairly reliably recognized as such by a simple heuristic algorithm. from wikipedia

Take a look at this site to see the algorithm

woliveirajr
  • 9,433
  • 1
  • 39
  • 49
  • Accepting this as the right answer since he pointed me to an algorithm rather than an existing toolkit. I actually also like the method used here: http://www.opensourcejavaphp.net/java/groovy/groovy/util/CharsetToolkit.java.html – SuPra Jul 26 '11 at 20:11
2

Look at ICU4J which includes a character detector

MJB
  • 9,352
  • 6
  • 34
  • 49
0

Take a look at Apache Commons IO, in particular BOMInputStream.

Sualeh Fatehi
  • 4,700
  • 2
  • 24
  • 28
  • Yeah but I don't have the BOM or any other necessary metadata to help me out here. That's why I said a best guess of. – SuPra Jul 26 '11 at 20:10