0

I have a Java-program that fetch rows from a SQL-Server DB and insert the same row into an Informix DB. The Informix DB only supports 8859-1 character set. Sometimes the users inserts a row in the SQL server DB by copy and paste from Word or Excel and that causes some characters to end up as Unicode characters (some of them 3-bytes in size).

How can i write a filter function that replaces the unicode characters with for example a '?' or something else ?

/Jimmy

J. Plexor
  • 33
  • 4
  • Maybe related: https://stackoverflow.com/questions/229015/encoding-conversion-in-java#229023 – Gabriel Molina Aug 22 '17 at 17:44
  • 8859-1 has 256 codepoints encoded with a value 0 to 255, so any sequence of byte values is valid. How would you tell that a byte sequence should be interpreted as UTF-8 instead of 8859-1? Where exactly are the users pasting _their_ text such that your system is mishandling it? – Tom Blodget Aug 22 '17 at 19:33

1 Answers1

2

You could replace all non-ASCII characters with ?:

StringBuilder buf = new StringBuilder();
for (char ch : originalString.toCharArray()) {
    if (ch > 127) {
        buf.append('?');
    } else {
        buf.append(ch);
    }
}
return buf.toString();

Another way is to use a regular expression:

originalString.replaceAll("\\P{ASCII}", "?")

It replaces all characters which are not ASCII characters with ?.

Roman Puchkovskiy
  • 11,415
  • 5
  • 36
  • 72