1

We have some data sourced in Italy and being displayed from a server in Poland. We are getting some instances of character substitution. Specifically, the à (small letter A with a grave) is getting substituted with an ŕ (small letter R with an acute). We can see that the à is a 00E0 in the CP1252 Western European character set, and the ŕ is the same value in the CP1250 Eastern European character set, so we know this is a character set issue.

The page is being served by a Websphere app server using JSPs. I have an experimental page where I can reproduce the problem, and sort of fix it, but not in an acceptible manner.

If I set this in my JSP:

response.setContentType("text/html;charset=windows-1250");

The problem is reproduced and the R with acute is displayed.

To sort of fix the problem, on the browser, I change the encoding to "Western European" in IE or "Western Windows-1252" in Chrome.

So this would naturally lead me to believe that if I set "windows-1252" in the content type, it would fix the problem, but it does not. When I do that, the character is then displayed as a question mark.

I have played with all kinds of combinations of response.setContentType, response.setCharacterEncoding, response.setLocale, <meta http-equiv>, <meta charset> and most everything results in the ? showing. Only setting 1250 on the content type and then changing the encoding on the browser itself seems to fix the problem.

Any suggestions?

Thanks

Srikanth Venugopalan
  • 9,011
  • 3
  • 36
  • 76
Mike
  • 21
  • 2
  • 1
    Can you try `response.setContentType("text/html;charset=utf-8");`? – Elliott Frisch Jan 03 '14 at 17:16
  • The sent content type header should match the type that's actually sent to the client, i.e. the charset of the output stream (which might or might not default to UTF-8 or a locale specific charset). You should manually set the charset of the output stream to the encoding you want to use (e.g. UTF-8) and then also use that encoding as the content type header. – Njol Jan 03 '14 at 17:19
  • You seem to have focused on the application's output. Did you verify, that the characters arrived correctly in the app? How to you read them? How do you then call setContentType in the JSP? With a scriptlet? Can you verify in the Browser with firebug or similar that the content-type header actually arrives in the browser? Are you sure that the wording at the top of http://docs.oracle.com/javaee/6/api/javax/servlet/ServletResponse.html about getWriter and commit is right? – Harald Jan 03 '14 at 17:50
  • If I use a charset of utf-8, the char displays as the r with acute. Interestingly, if I then change the browser encoding setting to western, the char changes to a capital A with some other accent mark. – Mike Jan 03 '14 at 18:12
  • As for focusing on the output, you are correct because the data being displayed is sourced from a working application in Italy and the data gets displayed properly there. Italy extracts the data using a Java program and sends it to a data center in Poland where another Java program inserts it into an MS SQL database. The WebSphere app server reads that db and displays the data to the user. – Mike Jan 03 '14 at 18:22
  • I have used the Chrome and IE developer tools to verify the contents of the content-type header and it is comming across properly: Content-Type:text/html;charset=windows-1252. I have verified that the set content type is the very first thing in the jsp. It is sending the content type I set in the header. It's just not doing what I would expect. – Mike Jan 03 '14 at 18:22
  • Ah, you send the data through several applications. Which one is the last where you verified that the encoding is correct. For example between italy and poland: is the encoding correct. Is everything ok in the db? – Harald Jan 05 '14 at 08:12
  • Not sure how to answer that. I can get it to display correctly now by choosing 1250 server side and changing the browser encoding to Western. I can also get it to display properly on my WAS console. So is it really "incorrect" anywhere? If I hit the Poland database with my query tool (AQT), it displays as the r, but I have no control over the display encoding with that. I have the text file that is input to the db, and if I open that in UltraEdit, using 1252 encoding, the proper a is dispayed. I feel like if I could just force the browser to use the proper display encoding, all would be OK. – Mike Jan 06 '14 at 13:42

1 Answers1

0

First of all, each source must come with the character set it has been encoded with (i.e. you must know it), otherwise you won't know what character set to use when presenting that source, and your problem will arise with the next data source.
Secondly, if you can, you should ask your sources to move to utf-8, and have those providers re-write their content.

As having a common character set for all you sources is the best solution (and using utf-8 is the most compatible / standard-oriented way of doing it as of today), if you can't make them doing the conversion, by knowing the source encoding you may try to convert the data content from the source charset to your charset using a converter (I haven't used any, so I can't give you any advice on this).

At last, two notes:
1) there's no way to show two contents that use different character sets in a single web application (neither in a single web page), since - like you already found - you may only use one encoding at a time;
2) if your data content is strictly web-oriented, you may ask your sources to use html entities (but keep in mind that this could be a problem if then you'll present that content in e.g. PDF form).

watery
  • 5,026
  • 9
  • 52
  • 92
  • Have a look at [this question](http://stackoverflow.com/questions/229015/encoding-conversion-in-java) for character set conversion. – watery Jan 03 '14 at 17:56
  • Thanks for the comments, but that does not solve the problem. Please see my comments above for more details. I am not trying to display two content types; I am trying to get the one Italian content type to display properly. Unfortunately, there is no way I can get the source to change/convert their data to UTF-8, much as I would like them to. – Mike Jan 03 '14 at 18:25
  • I just tried to cover all the possibilities I'm aware of (I faced a similar problem at work :-)). Note (1) was to remark that in case you have more than one character set for the data you have to present you will need to convert them to a common one. So, it looks like your unique option is character set conversion in your application, have you tried that? – watery Jan 03 '14 at 18:35
  • As I re-read your comments to you own question, another source of trouble (this is a guess) could be the database that receives that data (if it isn't stored as raw bytes of course), since that database column must have a character set associated and if the incoming data isn't properly treated before being insterted in the database, any source encoding could be lost, in favor of that database column encoding. – watery Jan 03 '14 at 18:42
  • Thanks. I have tried using the Charset API to do character conversion and could not get that to work. Could be that I wasn't using it properly though. This seems to be a case where you would have to remap the codepoint values and I don't really see how Charset actually does that. What I don't understand is why I can get the char to display properly by changing the encoding on the browser side, but I can't seem to do anything on the server to get the browser to do that by default. – Mike Jan 03 '14 at 18:43
  • I don't have that code anymore. I was working on this before the holidays and that was just one of the many thing I tried and tossed. I will try to reproduce it and post it later or tomorrow. Who knows, maybe it will actually work this time. :-) – Mike Jan 03 '14 at 18:56
  • Here's my conversion code: byte[] ba = rs.getBytes(1); Charset utf8 = Charset.forName("UTF-8"); Charset cp1252 = Charset.forName("Cp1252"); ByteBuffer inBb = ByteBuffer.wrap(ba); CharBuffer decodedIn = cp1252.decode(inBb); ByteBuffer encodedOut = utf8.encode(decodedIn); byte[] outputBa = encodedOut.array(); String s = new String(outputBa,utf8); It results in the a with grave being displayed as U followed by a box char. No diff if I specify 1250 or 1252 in Java. System.out shows it in the console with a box char between every correct char. I also changed the page encoding to UTF-8. – Mike Jan 03 '14 at 20:10