How to validate if a UTF-8 string contains mal-encoded characters

Question

In a large data set I have some data that looks like this:

"guide (but, yeah, itâ€™s okay to share it with â€˜em)."

I've opened the file in a hex editor and run the raw byte data through a character encoding detection algorithm (http://code.google.com/p/juniversalchardet/) and it's positively detected as UTF-8.

It appears to me that the source of the data mis-interpreted the original character set and wrote valid UTF-8 as the output that I have received.

I'd like to validate the data to the best I can. Are there any heuristics/algorithms out there that might help me take a stab at validation?

What is the source here? Did you push the original data to said source? At a first glance I'd say you tried and pushed cp-1252 apostrophes to it without them being converted to proper UTF-8 equivalents... — fge, Jan 09 '13 at 14:04
You need to show how you're reading the particular data from the data set and how you're presenting the particular data to the enduser/yourself. For example, are you using `FileReader` to read it and `System.out.println()` to present it? You have to tell one or both of them to use UTF-8 instead of the platform default charset which is recognizable as CP1252. — BalusC, Jan 09 '13 at 14:11
This looks like a UTF-8 data source (with U+2019 `’` encoded correctly as the octets `e2 80 99`) decoded using the single-byte windows-1252 encoding (where they are interpreted as the code points U+00e2 U+20ac U+2122 - `â€™`. — McDowell, Jan 09 '13 at 14:11
Possible duplicate of [Check if a String is valid UTF-8 encoded in Java](http://stackoverflow.com/questions/6622226/check-if-a-string-is-valid-utf-8-encoded-in-java) — james.garriss, Oct 02 '15 at 14:40

Esailija · Accepted Answer · 2013-01-09T14:17:59.060

You cannot do that once you have the string, you have to do it while you still have the raw input. Once you have the string, there is no way to automatically tell whether â€™ was actually intended input without some seriously fragile tests. For example:

public static boolean isUTF8MisInterpreted( String input ) {
          //convenience overload for the most common UTF-8 misinterpretation
          //which is also the case in your question
      return isUTF8MisInterpreted( input, "Windows-1252");  
}

public static boolean isUTF8MisInterpreted( String input, String encoding) {

    CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
    CharsetEncoder encoder = Charset.forName(encoding).newEncoder();
    ByteBuffer tmp;
    try {
        tmp = encoder.encode(CharBuffer.wrap(input));
    }

    catch(CharacterCodingException e) {
        return false;
    }

    try {
        decoder.decode(tmp);
        return true;
    }
    catch(CharacterCodingException e){
        return false;
    }       
}

public static void main(String args[]) {
    String test = "guide (but, yeah, itâ€™s okay to share it with â€˜em).";
    String test2 = "guide (but, yeah, it’s okay to share it with ‘em).";
    System.out.println( isUTF8MisInterpreted(test)); //true
    System.out.println( isUTF8MisInterpreted(test2)); //false

}

If you still have access to raw input, you can see if a byte array amounts to fully valid utf-8 byte sequences with this:

public static boolean isValidUTF8( byte[] input ) {

    CharsetDecoder cs = Charset.forName("UTF-8").newDecoder();

    try {
        cs.decode(ByteBuffer.wrap(input));
        return true;
    }
    catch(CharacterCodingException e){
        return false;
    }       
}

You can also use the CharsetDecoder with streams, by default it throws exception as soon as it sees invalid bytes in the given encoding.

This is by far the simplest solution I've found so far. Thanks! — Chepech, Oct 09 '13 at 21:14

score -5 · Answer 2 · answered Dec 15 '15 at 05:58

-5

If you are using HTML5 then just add the <meta charset="UTF-8"> inside the <head>

for HTML4 <meta http-equiv="Content-type" content="text/html;charset=UTF-8">

answered Dec 15 '15 at 05:58

Tabish

1,592
16
13

How to validate if a UTF-8 string contains mal-encoded characters

2 Answers2