How to remove bad characters that are not suitable for utf8 encoding in MySQL?

Question

I have dirty data. Sometimes it contains characters like this. I use this data to make queries like

WHERE a.address IN ('mydatahere')

For this character I get

org.hibernate.exception.GenericJDBCException: Illegal mix of collations (utf8_bin,IMPLICIT), (utf8mb4_general_ci,COERCIBLE), (utf8mb4_general_ci,COERCIBLE) for operation ' IN '

How can I filter out characters like this? I use Java.

Thanks.

score 10 · Answer 1 · edited Oct 13 '16 at 05:56

When I had problem like this, I used Perl script to ensure that data is converted to valid UTF-8 by using code like this:

use Encode;
binmode(STDOUT, ":utf8");
while (<>) {
    print Encode::decode('UTF-8', $_);
}

This script takes (possibly corrupted) UTF-8 on stdin and re-prints valid UTF-8 to stdout. Invalid characters are replaced with � (U+FFFD, Unicode replacement character).

If you run this script on good UTF-8 input, output should be identical to input.

If you have data in database, it makes sense to use DBI to scan your table(s) and scrub all data using this approach to make sure that everything is valid UTF-8.

This is Perl one-liner version of this same script:

perl -MEncode -e "binmode STDOUT,':utf8';while(<>){print Encode::decode 'UTF-8',\$_}" < bad.txt > good.txt

EDIT: Added Java-only solution.

This is an example how to do this in Java:

import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.CharacterCodingException;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CodingErrorAction;

public class UtfFix {
    public static void main(String[] args) throws InterruptedException, CharacterCodingException {
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.REPLACE);
        decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
        ByteBuffer bb = ByteBuffer.wrap(new byte[] {
            (byte) 0xD0, (byte) 0x9F, // 'П'
            (byte) 0xD1, (byte) 0x80, // 'р'
            (byte) 0xD0,              // corrupted UTF-8, was 'и'
            (byte) 0xD0, (byte) 0xB2, // 'в'
            (byte) 0xD0, (byte) 0xB5, // 'е'
            (byte) 0xD1, (byte) 0x82  // 'т'
        });
        CharBuffer parsed = decoder.decode(bb);
        System.out.println(parsed);
        // this prints: Пр?вет
    }
}

You'd want a CharsetEncoder with the same replacing behaviour first, to get the bytes for the input string, before then decoding it back to String. — bobince, Dec 01 '12 at 11:06
you will need to replace ByteBuffer in this example with actual byte stream from your source - would that be text file or SQL column. gist of this example is how to make sure that your UTF-8 stream is definitely compliant UTF-8. It is up to you how do you exactly get this data into your `ByteBuffer` or `ByteArrayInputStream` — mvp, Dec 01 '12 at 22:22
Note that if you need a Reader, rather than handling buffers directly, as of Java 7, [InputStreamReader](http://docs.oracle.com/javase/7/docs/api/java/io/InputStreamReader.html#InputStreamReader(java.io.InputStream,%20java.nio.charset.CharsetDecoder)) can accept a pre-configured `CharsetDecoder`, as in [this example](http://stackoverflow.com/a/7281034/1303595). — MandisaW, Sep 14 '16 at 19:38
Why does the java solution not use ? instead of the unicode replacement character? Worse, all of the docs indicate that the default replacement character is \uFFFD... perhaps it's a bug? — Mike McCoy, Oct 05 '16 at 00:05
I edited the post to add a backslash in the `$_` variable in the perl oneliner. — Diego Sevilla, Oct 13 '16 at 05:57

Ring · Answer 2 · 2019-11-18T21:47:10.473

6

You can encode and then decode it to/from UTF-8:

String label = "look into my eyes 〠.〠";

Charset charset = Charset.forName("UTF-8");
label = charset.decode(charset.encode(label)).toString();

System.out.println(label);

output:

look into my eyes ?.?

edit: I think this might only work on Java 6.

edited Nov 18 '19 at 21:47

answered Jul 17 '14 at 04:53

Ring

2,249
5
27
40

1

Well, you just removed perfectly valid Unicode characters "〠.〠" using this method. Question was how to replace only _invalid_ characters, not all of them. – mvp Oct 05 '15 at 18:56
Just tested, it gives `look into my eyes 〠.〠` output, and that's right, because that code should not cause loss of Unicode characters, unless you are using ASCII for Java source files. – Dmitry Ratty Jul 04 '17 at 12:16

score 3 · Answer 3 · answered Dec 01 '12 at 08:12

3

You can filter surrogate characters with this regex:

String str  = ""; //U+20000, represented by 2 chars in java (UTF-16 surrogate pair)
str = str.replaceAll( "([\\ud800-\\udbff\\udc00-\\udfff])", "");
System.out.println(str.length()); //0

answered Dec 01 '12 at 08:12

Esailija

138,174
23
272
326

(Note this will remove all use of surrogates, not just invalid sequences.) – bobince Dec 01 '12 at 10:52
@bobince yeah, I am thinking of MySQL which afaik has problems with characters beyond BMP. http://bugs.mysql.com/bug.php?id=25666 – Esailija Dec 01 '12 at 11:01
Ah, yes - you have to use utfmb4 to store the astral planes and that's relatively new. (Well, unless you just dump everything into a binary string or use a misleading encoding, so you don't get the benefit of Unicode-aware string collation.) – bobince Dec 01 '12 at 11:04
2

Doesn't dc00 come right after dbff? Why can't this just be "([\\ud800-\\udfff])" – MattyB Dec 06 '16 at 23:08

score 2 · Answer 4 · edited May 23 '17 at 11:54

Once you convert the byte array to String on the java machine, you'll get (by default on most machines) UTF-16 encoded string. The proper solution to get rid of non UTF-8 characters is with the following code:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
    System.out.println(values[i].replaceAll(
                    //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
            , ""));
}

or if you want to validate if some string contains non utf8 characters you would use Pattern.matches like:

String[] values = {"\\xF0\\x9F\\x98\\x95", "\\xF0\\x9F\\x91\\x8C", "/*", "look into my eyes 〠.〠", "fkdjsf ksdjfslk", "\\xF0\\x80\\x80\\x80", "aa \\xF0\\x9F\\x98\\x95 aa", "Ok"};
for (int i = 0; i < values.length; i++) {
    System.out.println(Pattern.matches(
                    ".*(" +
                    //"[\\\\x00-\\\\x7F]|" + //single-byte sequences   0xxxxxxx - commented because of capitol letters
                    "[\\\\xC0-\\\\xDF][\\\\x80-\\\\xBF]|" + //double-byte sequences   110xxxxx 10xxxxxx
                    "[\\\\xE0-\\\\xEF][\\\\x80-\\\\xBF]{2}|" + //triple-byte sequences   1110xxxx 10xxxxxx * 2
                    "[\\\\xF0-\\\\xF7][\\\\x80-\\\\xBF]{3}" //quadruple-byte sequence 11110xxx 10xxxxxx * 3
                    + ").*"
            , values[i]));
}

For making a whole web app be UTF8 compatible read here:
How to get UTF-8 working in Java webapps
More on Byte Encodings and Strings.
You can check your pattern here.
The same in PHP here.

Can you explain the "commented out because of capital letters"? — aliteralmind, Jun 29 '15 at 13:15

score -1 · Accepted Answer · edited Dec 01 '12 at 14:08

-1

May be this will help someone as it helped me.

public static String removeBadChars(String s) {
  if (s == null) return null;
  StringBuilder sb = new StringBuilder();
  for(int i=0;i<s.length();i++){ 
    if (Character.isHighSurrogate(s.charAt(i))) continue;
    sb.append(s.charAt(i));
  }
  return sb.toString();
}

edited Dec 01 '12 at 14:08

dda

6,030
2
25
34

answered Dec 01 '12 at 09:11

Stepan Yakovenko

8,670
28
113
206

You want to remove those that are `isLowSurrogate()` as well. If your string has a valid >BMP character, this will leave the string corrupted I think. – Esailija Dec 01 '12 at 11:09

score -1 · Answer 6 · answered Dec 16 '19 at 15:29

-1

In PHP - I approach this by only allowing printable data. This really helps in cleaning the data for DB.
It's pre-processing though and sometimes you don't have that luxury.

$str = preg_replace('/[[:^print:]]/', '', $str);

answered Dec 16 '19 at 15:29

Chris Lambrou

356
2
7

How to remove bad characters that are not suitable for utf8 encoding in MySQL?

6 Answers6

Linked