22

I need to be able to take a string in Java and determine whether or not all of the characters contained within it are in a specified character set (e.g. ISO-8859-1). I've looked around quite a bit for a simple way to do this (including playing around with a CharsetDecoder), but have yet to be able to find something.

What is the best way to take a string and determine if all the characters are within a given character set?

Michael
  • 2,460
  • 3
  • 27
  • 47

2 Answers2

32

Class CharsetEncoder in package java.nio.charset offer a method canEncode to test if a specific character is supported.

Michael basically did something like this:

Charset.forName( CharEncoding.ISO_8859_1 ).newEncoder().canEncode("string")

Note that CharEncoding.ISO_8859_1 rely on Apache commons and may be replaced by "ISO_8859_1".

Community
  • 1
  • 1
Aubin
  • 14,617
  • 9
  • 61
  • 84
  • 1
    Excellent! That seems to do exactly what I want and is extremely clean and simple. Now I feel silly for asking after spending all this time looking at the opposite class (`CharsetDecoder`). Thanks! – Michael Oct 30 '12 at 17:26
  • Just for reference for anyone I basically did something like this: `Charset.forName(CharEncoding.ISO_8859_1).newEncoder().canEncode("string")` – Michael Oct 30 '12 at 17:32
  • 1
    I know this post is old, but it was the first result in my search. For those who want to determine if a string is encoded in one of the IBM EBCDIC character sets, like IBM-1047, use "Cp1047". For IBM-737 use "Cp737. Reference Java 7 documentation: https://docs.oracle.com/javase/7/docs/technotes/guides/intl/encoding.doc.html – John Czukkermann Sep 17 '18 at 16:21
2

I think that the easiest way will be to have a table of which Unicode characters can be represented in the target character set encoding and then testing each character in the string. For the ISO-8859 family, the table can usually be represented by one or a few ranges of Unicode characters, making the test relatively easy. It's a lot of hand work, but needs to be done only once.

EDIT: or use Aubin's answer if the charset is supported in your Java implementation. :)

Community
  • 1
  • 1
Ted Hopp
  • 232,168
  • 48
  • 399
  • 521
  • @Aubin - Cheers. Of course, your solution only works if the Java implementation supports the target [`CharSet`](http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html). (No problem for ISO-8859-1 and the other standard charsets, but other ISO-8859 encodings are usually not supported.) – Ted Hopp Oct 30 '12 at 17:25