Java and Unicode trouble

Question

I have a Java-program that fetch rows from a SQL-Server DB and insert the same row into an Informix DB. The Informix DB only supports 8859-1 character set. Sometimes the users inserts a row in the SQL server DB by copy and paste from Word or Excel and that causes some characters to end up as Unicode characters (some of them 3-bytes in size).

How can i write a filter function that replaces the unicode characters with for example a '?' or something else ?

/Jimmy

Maybe related: https://stackoverflow.com/questions/229015/encoding-conversion-in-java#229023 — Gabriel Molina, Aug 22 '17 at 17:44
8859-1 has 256 codepoints encoded with a value 0 to 255, so any sequence of byte values is valid. How would you tell that a byte sequence should be interpreted as UTF-8 instead of 8859-1? Where exactly are the users pasting _their_ text such that your system is mishandling it? — Tom Blodget, Aug 22 '17 at 19:33

score 2 · Answer 1 · answered Aug 22 '17 at 17:41

You could replace all non-ASCII characters with ?:

StringBuilder buf = new StringBuilder();
for (char ch : originalString.toCharArray()) {
    if (ch > 127) {
        buf.append('?');
    } else {
        buf.append(ch);
    }
}
return buf.toString();

Another way is to use a regular expression:

originalString.replaceAll("\\P{ASCII}", "?")

It replaces all characters which are not ASCII characters with ?.

Java and Unicode trouble

1 Answers1