0

Recently in a case, I found a string that has a control character in it which we are saving into the DB and trying to create an xml and an HTML file from it. It is getting saved properly in DB and showing as follows at different locations.
1) When querying into the DB it shows name as enter image description here.
2) When I copy this on notepad++ (UTF-8 encoding) it is shown as enter image description here.
3) In Eclipse IDE, debugging mode shows it as same as DB.
4) In table records in the HTML page (apache/tomcat) and as sysout output in console shows it as simple enter image description here, which I think is preferable and intended output.

I am able to create the XML file with some junk character in it but when I am trying to create the HTML using javax TransformerFactory with UTF-8 encoding. transformer.transform(source, result);
throws the exception "Illegal HTML character - decimal 129 ".
I understand that there is some control character in the string which is not supported by UTF-8 and thus parser is throwing this exception.
I found its references here:: https://www.fileformat.info/info/unicode/char/0081/index.htm

To resolve it I tried many things but the one which results close to the intended one is to parse the strings manually before giving it to the parser and changed it to UTF-8 string as below :
String str = new String(nodeValue.getBytes(StandardCharsets.US_ASCII), StandardCharsets.UTF_8); str = str.replaceAll("[^\p{ASCII}]", "");

This solves the issue up to a certain level but I think parsing the whole content is not preferable to remove 1 control character from a String, and this is as well converting name enter image description here to enter image description here which is not preferable, I actually want it without any change.

Is there any standard way to do this, so that we can get the correct output in parsed HTML?

How sysout and apache tomcat's HTML page is showing it correctly? Do they handle it explicitly?

  • HTML and XML do not allow every Unicode codepoint. (But UTF-8 does.) Where did this text come from, what are its bytes and character encoding and what is the intent of the codepoints you are identifying as control characters? – Tom Blodget Sep 05 '18 at 02:38
  • Thanks @Tom Blodget, this text is entered by a client in GUI, I am not sure how? Seems to be a latin character. I tried printing this in a normal HTML file, it just skips the character and shows the rest of the string. The error comes only from the transformation when we are trying to parse it using transformer.transform(source, result); javax api. Below is the bytes of the string, for the control character it is [-62,-127] in utf_8 [U+003E, 7F]. [116, 101, 115, 116, -62, -127, 49, 38, 35, 49, 50, 57, 59] –  Sep 05 '18 at 06:39

1 Answers1

0

 is illegal in HTML no matter how you express it as a character in the document.

It looks like someone is testing you. Either the GUI shouldn't have allowed it or you have to live with the inability to show it in HTML. If you simply need to show it, you could convert it to an image. Unfortunately, there is not a corresponding Control Picture for [HOP].

Tom Blodget
  • 20,260
  • 3
  • 39
  • 72
  • Thanks, Then I think I should use something like this replaceAll("\\p{Cc}", "") as an exception handling mechanism : suggested here https://stackoverflow.com/questions/3438854/replace-unicode-control-characters/3439206#3439206 –  Sep 06 '18 at 09:47