5

Hello I am looking for a way to detect if a string has being encoded

For example

    String name = "Hellä world";
    String encoded = new String(name.getBytes("utf-8"), "iso8859-1");

The output of this encoded variable is:

Hellä world

As you can see there is an A with grave and another symbol. Is there a way to check if the output contains encoded characters?

Decrypter
  • 2,784
  • 12
  • 38
  • 57
  • 3
    All characters are encoded. Are you trying to tell if a character has been encoded as two bytes or more instead of one? – Peter Lawrey Jul 03 '12 at 10:38
  • If you're trying to check whether the string `name` can be correctly encoded in ISO-8859-1 then [this existing question](http://stackoverflow.com/q/13144250/441108) (linked from one of this question's links) looks like the answer. – Richard Barnett Mar 13 '14 at 23:44

6 Answers6

14

Sounds like you want to check if a string that was decoded from bytes in latin1 could have been decoded in UTF-8, too. That's easy because illegal byte sequences are replaced by the character \ufffd:

String recoded = new String(encoded.getBytes("iso-8859-1"), "UTF-8");
return recoded.indexOf('\uFFFD') == -1; // No replacement character found
Joni
  • 108,737
  • 14
  • 143
  • 193
5

Your question doesn't make sense. A java String is a list of characters. They don't have an encoding until you convert them into bytes, at which point you need to specify one (although you will see a lot of code that uses the platform default, which is what e.g. String.getBytes() with no argument does).

I suggest you read this http://kunststube.net/encoding/.

artbristol
  • 32,010
  • 5
  • 70
  • 103
  • 4
    This answer is absolutely correct, but may still be somewhat cryptic to newbies. The question, really, is "*How can I tell if a String has been encoded with a certain encoding?*" The short answer is: trial and error. You can set up a `CharsetDecoder` configured for a particular target encoding (UTF-8/ISO-8859-1, etc.), and try to run your String through that decoder. If the decoding fails or throws an exception, you know your String contains 1+ characters that aren't that target encoding. If the decoder decodes without error, then you know your String meets the criteria for that encoding. –  Aug 27 '13 at 12:44
5
String name = "Hellä world";
String encoded = new String(name.getBytes("utf-8"), "iso8859-1");

This code is just a character corruption bug. You take a UTF-16 string, transcode it to UTF-8, pretend it is ISO-8859-1 and transcode it back to UTF-16, resulting in incorrectly encoded characters.

McDowell
  • 107,573
  • 31
  • 204
  • 267
5

If I correctly understood your question, this code may help you. The function isEncoded check if its parameter could be encoded as ascii or if it contains non ascii-chars.

public boolean isEncoded(String text){

    Charset charset = Charset.forName("US-ASCII");
    String checked=new String(text.getBytes(charset),charset);
    return !checked.equals(text);

}

@Test
public void testAscii() throws Exception{
    Assert.assertFalse(isEncoded("Hello world"));
}


@Test
public void testNonAscii() throws Exception{
    Assert.assertTrue(isEncoded("Hellä world"));
}

You can also check for other charset changing charset var or moving it to a parameter.

Andrea Parodi
  • 5,534
  • 27
  • 46
3

I'm not really sure what are you trying to do or what is your problem.

This line doesn't make any sense:

String encoded = new String(name.getBytes("utf-8"), "iso8859-1");

You are encoding your name into "UTF-8" and then trying to decode as "iso8859-1".

If you what to encode your name as "iso8859-1" just do name.getBytes("iso8859-1").

Please tell us what is the problem you encountered so that we can help more.

bruno conde
  • 47,767
  • 15
  • 98
  • 117
0

You can check that your string is encoded or not by this code

public boolean isEncoded(String input) {

    char[] charArray = input.toCharArray();
    for (int i = 0, charArrayLength = charArray.length; i < charArrayLength; i++) {
        Character c = charArray[i];
        if (Character.getType(c) == Character.OTHER_LETTER)){
            return true;
        }
    }
    return false;
}
Pooya
  • 4,385
  • 6
  • 45
  • 73
  • 1
    I think you are only testing if the String contains a char in "other letter" unicode group. But Character.getType('ä') == Character.LOWERCASE_LETTER and Character.getType('a') == Character.LOWERCASE_LETTER – Andrea Parodi Jul 03 '12 at 11:29
  • Yes, because I think the question is how to find that a string contains encoded chars or not, and this method returns that – Pooya Jul 03 '12 at 14:46
  • But Character.getType('ä') == Character.LOWERCASE_LETTER and Character.getType('ä') != Character.OTHER_LETTER, so your code does not work. The Character.OTHER_LETTER does not contain all unicode chars, only a particular subgroup. – Andrea Parodi Jul 03 '12 at 14:59