17

I have dirty data. Sometimes it contains characters like this. I use this data to make queries like

WHERE a.address IN ('mydatahere')

For this character I get

org.hibernate.exception.GenericJDBCException: Illegal mix of collations (utf8_bin,IMPLICIT), (utf8mb4_general_ci,COERCIBLE), (utf8mb4_general_ci,COERCIBLE) for operation ' IN '

How can I filter out characters like this? I use Java.

Thanks.

Galimov Albert
  • 7,269
  • 1
  • 24
  • 50
Stepan Yakovenko
  • 8,670
  • 28
  • 113
  • 206

6 Answers6

10

When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:

use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
    print Encode::decode('UTF-8', $_);
}

This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with (U+FFFD, Unicode replacement character).

If you run this script on good UTF-8 input, output should be identical to input.

If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.

This is Perl one-liner version of this same script:

perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt

EDIT: Added Java-only solution.

This is an example how to do this in Java:

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;

public class UtfFix {
    public static void main(String[] args) throws InterruptedException, CharacterCodingException {
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPLACE);
        decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        ByteBuffer bb = ByteBuffer.wrap(new byte[] {
            (byte) 0xD0, (byte) 0x9F, // 'П'
            (byte) 0xD1, (byte) 0x80, // 'р'
            (byte) 0xD0,              // corrupted UTF-8, was 'и'
            (byte) 0xD0, (byte) 0xB2, // 'в'
            (byte) 0xD0, (byte) 0xB5, // 'е'
            (byte) 0xD1, (byte) 0x82  // 'т'
        });
        CharBuffer parsed = decoder.decode(bb);
        System.out.println(parsed);
        // this prints: Пр?вет
    }
}
Diego Sevilla
  • 28,636
  • 4
  • 59
  • 87
mvp
  • 111,019
  • 13
  • 122
  • 148
  • Sorry, i need java solution. – Stepan Yakovenko Dec 01 '12 at 07:40
  • You'd want a CharsetEncoder with the same replacing behaviour first, to get the bytes for the input string, before then decoding it back to String. – bobince Dec 01 '12 at 11:06
  • you will need to replace ByteBuffer in this example with actual byte stream from your source - would that be text file or SQL column. gist of this example is how to make sure that your UTF-8 stream is definitely compliant UTF-8. It is up to you how do you exactly get this data into your `ByteBuffer` or `ByteArrayInputStream` – mvp Dec 01 '12 at 22:22
  • Note that if you need a Reader, rather than handling buffers directly, as of Java 7, [InputStreamReader](http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.nio.charset.CharsetDecoder)) can accept a pre-configured `CharsetDecoder`, as in [this example](http://stackoverflow.com/a/7281034/1303595). – MandisaW Sep 14 '16 at 19:38
  • Why does the java solution not use ? instead of the unicode replacement character? Worse, all of the docs indicate that the default replacement character is \uFFFD... perhaps it's a bug? – Mike McCoy Oct 05 '16 at 00:05
  • I edited the post to add a backslash in the `$_` variable in the perl oneliner. – Diego Sevilla Oct 13 '16 at 05:57
6

You can encode and then decode it to/from UTF-8:

String label = "look into my eyes 〠.〠";

Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();

System.out.println(label);

output:

look into my eyes ?.?

edit: I think this might only work on Java 6.

Ring
  • 2,249
  • 5
  • 27
  • 40
  • 1
    Well, you just removed perfectly valid Unicode characters "〠.〠" using this method. Question was how to replace only _invalid_ characters, not all of them. – mvp Oct 05 '15 at 18:56
  • Just tested, it gives `look into my eyes 〠.〠` output, and that's right, because that code should not cause loss of Unicode characters, unless you are using ASCII for Java source files. – Dmitry Ratty Jul 04 '17 at 12:16
3

You can filter surrogate characters with this regex:

String str  = ""; //U+20000, represented by 2 chars in java (UTF-16 surrogate pair)
str = str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", "");
System.out.println(str.length()); //0
Esailija
  • 138,174
  • 23
  • 272
  • 326
  • (Note this will remove all use of surrogates, not just invalid sequences.) – bobince Dec 01 '12 at 10:52
  • @bobince yeah, I am thinking of MySQL which afaik has problems with characters beyond BMP. http://bugs.mysql.com/bug.php?id=25666 – Esailija Dec 01 '12 at 11:01
  • Ah, yes - you have to use utfmb4 to store the astral planes and that's relatively new. (Well, unless you just dump everything into a binary string or use a misleading encoding, so you don't get the benefit of Unicode-aware string collation.) – bobince Dec 01 '12 at 11:04
  • 2
    Doesn't dc00 come right after dbff? Why can't this just be "([\\ud800-\\udfff])" – MattyB Dec 06 '16 at 23:08
2

Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
    System.out.println(values[i].replaceAll(
                    //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
            , ""));
}

or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
    System.out.println(Pattern.matches(
                    ".*(" +
                    //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    + ").*"
            , values[i]));
}

For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.

Community
  • 1
  • 1
despot
  • 7,167
  • 9
  • 44
  • 63
-1

May be this will help someone as it helped me.

public static String removeBadChars(String s) {
  if (s == null) return null;
  StringBuilder sb = new StringBuilder();
  for(int i=0;i<s.length();i++){ 
    if (Character.isHighSurrogate(s.charAt(i))) continue;
    sb.append(s.charAt(i));
  }
  return sb.toString();
}
dda
  • 6,030
  • 2
  • 25
  • 34
Stepan Yakovenko
  • 8,670
  • 28
  • 113
  • 206
  • You want to remove those that are `isLowSurrogate()` as well. If your string has a valid >BMP character, this will leave the string corrupted I think. – Esailija Dec 01 '12 at 11:09
-1

In PHP - I approach this by only allowing printable data. This really helps in cleaning the data for DB.
It's pre-processing though and sometimes you don't have that luxury.

$str = preg_replace('/[[:^print:]]/', '', $str);
Chris Lambrou
  • 356
  • 2
  • 7