Why does this character encoding issue only occur on select systems?

Question

We are using a JavaScript WYSIWYG text editor called CKEditor. The editor has a source view that marks up, with HTML, what the user has entered in the text editor. Sometimes the editor will insert non-breaking spaces ( ) into this source view, which is fine.

Everything seemed to work correctly on the dev machines so we deployed to our production servers. At this point we started seeing a weird Â character (Â) being inserted into the text. After some reading I saw that this was reported in several tickets on the CKEditor bug tracking page. I was able to resolve the issue by setting the charset attribute on the script tag for ckeditor.js to UTF-8.

My question is this: Why did the script tag need the charsetattribute set in the first place, and why only on certain systems?

The last comment on this SO question mentions that the byte sequence for a non-breaking space in UTF-8 is actually the Â character followed by a non-breaking space in latin1 (which is ISO-8859-1 right?). This could definitely be a clue because another Â character is inserted, one after another, every time the user switches to source view. It is as if the CKEditor framework is trying to inject a non-breaking space, but that gets turned into Â&nbsp, then ÂÂ&nbsp, and so on. The content-type on all systems (viewed from Chrome debugger) is text/html;charset=ISO-8859-1, which I am unsure why. The Dfile.encoding option in all Tomcat configs is set to utf-8. The meta tag is also <meta charset="utf-8">.

`FILE.encoding` should be `file.encoding`; System properties are case sensitive! — Aaron Digulla, Oct 17 '13 at 14:54
@AaronDigulla Sorry, it is lower case on the servers, I just typed it incorrectly. Fixed now, thanks! — theblang, Oct 17 '13 at 15:10

score 1 · Answer 1 · edited May 23 '17 at 11:44

1

Fire up your development tools in the Web browser. When a form is rendered / submitted, stop and look at the request and response headers that are sent back and forth. Make sure you see UTF-8 everywhere. If it's missing, then one side will assume "default encoding" - whatever that might be.

Also make sure you have set the charset on the forms because they don't automatically inherit the one from the page.

EDIT This page explains in detail how you can set the charset when using Tomcat plus the necessary code for your servlets.

edited May 23 '17 at 11:44

Community

1
1

answered Oct 17 '13 at 15:26

Aaron Digulla

321,842
108
597
820

So I checked the `content-type` attribute on the GET request for the page and it is `text/html;charset=ISO-8859-1`. After that I don't see what else could matter since everything is client side at that point with the JavaScript CKEditor library. Maybe I'm wrong. – theblang Oct 17 '13 at 18:56
You're probably missing a `response.setCharacterEncoding("UTF-8")` in your Servlet's code. See my edit for details. – Aaron Digulla Oct 18 '13 at 09:03
Nice, doing that in my Spring controller changed the `content-type` to `UTF-8` instead of `ISO-8859-1`. The real mystery for me though is why would this encoding problem not occur on our Windows development machines or our Linux test server, but manifest on the Linux production servers. I thought surely it would be a config difference, but `Dfile.encoding` is all I can think to check. – theblang Oct 18 '13 at 13:26
There might be a proxy between the production server and the browser which inserts an encoding header if there isn't one. – Aaron Digulla Oct 18 '13 at 13:30

Why does this character encoding issue only occur on select systems?

1 Answers1