charset issue with XSS api in CQ5 , Ã being displayed as Ã�

Question

I'm using com.adobe.granite.xss for encoding strings in JSP. It seems to work with most characters, except for Ã. Ã is displayed as Ã�.

It happens when using xssAPI.encodeForHTML() method. I have tried <cq:text> with escapeXml="true" and it has the same behaviour.

The characters are stored properly in the repository and i have also set content="text/html; charset=utf-8" in the JSP.

Is there a way to encode or filter the input for XSS without the charset breaking in such situations.

I have tried it with different non-latin characters and most of them are not affected by XSS api.

enter image description here

Character Â appears to have the same problem. Since e.g. Ã = U+00C3 which is 0xC3 0x83 in UTF-8, it seems that this part of the data is UTF-8 encoded data that has got its bytes misinterpreted as ISO-8859-1 data (and “�” is perhaps an indication of the fact that 0x83 is assigned to a control code in ISO-8859-1). — Jukka K. Korpela, Nov 14 '14 at 08:35

score 2 · Accepted Answer · edited May 23 '17 at 12:28

It looks like it's an issue of owasp-esapi-java which is used in CQ's XSSAPI, because it's iterating through string using a charAt() method. But Ã is outside of BMP so, right way of iterating would be:

final int length = s.length();
for (int offset = 0; offset < length; ) {
   final int codepoint = s.codePointAt(offset);

   // do something with the codepoint

   offset += Character.charCount(codepoint);
}

(form How can I iterate through the unicode codepoints of a Java String?)

So I think that it's an issue of this library.

Try to use xssAPI.filterHTML(), probably it can solve your issue.

charset issue with XSS api in CQ5 , Ã being displayed as Ã�

1 Answers1