0

My web application (Java/Tomcat/Spring/Maven) is having trouble dealing with special characters like (hex 92, decimal 146). This comes into my app as another weird character.

I have looked at this question and verified that I I have the following line in all my JSP files:

<%@ page contentType="text/html; charset=UTF-8" %>

I also looked at this question and verified that I have the following line in my Maven pom.xml:

<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>

So as far as I can tell everything should be built and handled in UTF-8. But when I submit the string Martin’s Auto Repair what shows up at the server during the Spring binding process is Martinâ\u0080\u0099s Auto Repair. This is the string that gets handed back by Tomcat to my application.

Worse, this is echoed back to the browser so submitting the altered string again expands the weird characters over and over.

Any suggestions? At this point I'm not sure if this is a browser problem or a server problem.

Community
  • 1
  • 1
user3120173
  • 1,758
  • 7
  • 25
  • 39
  • Do you handle _requests_ as UTF-8 as well? And what do you call submit? As a query parameter? – fge Feb 26 '14 at 18:22

1 Answers1

2

Hex 92 is not a character in Unicode (http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF)

Windows codepage 1252 is not 100% identical to Unicode.

Thorbjørn Ravn Andersen
  • 73,784
  • 33
  • 194
  • 347
  • additionally it looks like you _somewhere_ parse an UTF-8 encoded byte stream as ISO-Latin-1 or you would not see the interesting sequence in the output. – Thorbjørn Ravn Andersen Feb 26 '14 at 18:37
  • Thank you for your response. Assuming your analysis is correct, what is the solution for this problem? Is it a Tomcat configuration issue? A browser issue? – user3120173 Feb 26 '14 at 20:19
  • And is this a *general* problem with my setup, or did I happen to pick one particular example (hex 92) that does not work? – user3120173 Feb 26 '14 at 20:22
  • This definitely appears to be the problem, so I'm going to close this question and open another one: "How do I detect invalid UTF-8 strings?" – user3120173 Feb 26 '14 at 22:00
  • File encodings are different between platforms. I would suggest that you use ASCII as the source encoding only - and use \u0000 notation for those characters outside that range. This will ensure that your sources are platform independent. – Thorbjørn Ravn Andersen Feb 26 '14 at 23:18