Recently in a case, I found a string that has a control character in it which we are saving into the DB and trying to create an xml and an HTML file from it. It is getting saved properly in DB and showing as follows at different locations.
1) When querying into the DB it shows name as .
2) When I copy this on notepad++ (UTF-8 encoding) it is shown as .
3) In Eclipse IDE, debugging mode shows it as same as DB.
4) In table records in the HTML page (apache/tomcat) and as sysout output in console shows it as simple , which I think is preferable and intended output.
I am able to create the XML file with some junk character in it but when I am trying to create the HTML using javax TransformerFactory with UTF-8 encoding.
transformer.transform(source, result);
throws the exception "Illegal HTML character - decimal 129 ".
I understand that there is some control character in the string which is not supported by UTF-8 and thus parser is throwing this exception.
I found its references here::
https://www.fileformat.info/info/unicode/char/0081/index.htm
To resolve it I tried many things but the one which results close to the intended one is to parse the strings manually before giving it to the parser and changed it to UTF-8 string as below :
String str = new String(nodeValue.getBytes(StandardCharsets.US_ASCII), StandardCharsets.UTF_8);
str = str.replaceAll("[^\p{ASCII}]", "");
This solves the issue up to a certain level but I think parsing the whole content is not preferable to remove 1 control character from a String, and this is as well converting name to
which is not preferable, I actually want it without any change.
Is there any standard way to do this, so that we can get the correct output in parsed HTML?
How sysout and apache tomcat's HTML page is showing it correctly? Do they handle it explicitly?