1

I got a strange issue with wrong URI Encoding and would appreciate any help!

The project uses JSPs, Servlets, Jquery, Tomcat 6.

Charset in the JSPs is set to UTF-8, all Tomcat connectors use URIEncoding=UTF-8 and I also use a character encoding filter as described here. Also, I set the contentType in the meta Tag and my browser detects it correctly.

In Ajax calls with Jquery I use encodeURIComponent() on the terms I want to use as URL Parameters and then serialize the whole parameter set with $.param(). In the called servlet these parameters are decoded correctly with Java.net.URLDecoder.decode(term, "UTF-8").

In some places I generate URLs for href elements from a parameter map in the JSPs. Each parameter value is encoded with Java.net.URLEncoder.encode(value, "UTF-8") on JSP side but then decoding it the same way as before results in broken special characters. Instead, I have to encode it as "ISO-8859-2" in the JSP which is then decoded correctly as "UTF-8" in the servlet.

An example for clarifying: The term "überfall" is URIEncoded via Javascript (%C3%BCberfall) and sent to the servlet for decoding and processing, which works. After passing it back to a JSP I would encode it as UTF-8 and build the URL which results for instance in:

<a href="/myWebapp/servletPath?term=%C3%BCberfall">Click here</a>

However, clicking this link will send the parameter as "%C3%83%C2%BCberfall" to the servlet which decodes to "überfall". The same occurs when no encoding takes place.

When, using "ISO-8859-2" for encoding I get:

<a href="/myWebapp/servletPath?term=%FCberfall">Click here</a>

When clicking this link I can observe in Wireshark that %C3%BCberfall is sent as parameter which decodes again to "überfall"!

Can anyone tell me where I miss something?

EDIT: While observing the Network Tab in Firebug I realized that by using

$.param({term : encodeURIComponent(term)}); 

the term is UTF-8 encoded twice, resulting in "%25C3%25BCberfall", i.e. the percent symbols are also percent-encoded. Analogously, it works for me if I call encode(term, "UTF-8") twice on each value from the parameter map.

Encoding once and not decoding the String results in "überfall" again.

KahPhi
  • 73
  • 2
  • 8
  • [This is a thorough answer](http://stackoverflow.com/a/138950/95033) on setting up a Java webapp for UTF-8. I keep it around for reference. However, I think you got everything covered and do not have any idea yet how to solve your problem, sorry. – Wolfram Jul 16 '12 at 13:28
  • If you view the source of the html how does the href look like then? – jontro Jul 16 '12 at 14:14
  • @Wolfram thanks, this is a nice summary. I think that I already implemented all the things listed there... – KahPhi Jul 16 '12 at 14:26
  • @jontro the html snippets in my post are from the page source as shown in Firebug. – KahPhi Jul 16 '12 at 14:27
  • @KahPhi you should not decode the result of request.getParameter(). This should already be done by the servlet filter, could this be the cause? – jontro Jul 16 '12 at 14:29
  • @jontro The charset filter I use does nothing but setCharacterEncoding in requests and responses to UTF-8. No en/decoding is done here. It is basically similar to the one described in Wolfram's first comment's link or the one delivered with Tomcat. But you are right: the decoding step should not be needed if everything is setup correctly. This is the first time I use JSPs and might also be the last time :) – KahPhi Jul 16 '12 at 16:14

2 Answers2

1

What encoding is Java using internally? Did you start your application with

-Dfile.encoding=utf-8

Please clarify where the "parameter map in the JSPs" is defined. Does it come from some persistent datastorage or are the strings given in your code as literals?

Some thoughts on what is going on, which might help:

ü is what comes out when a UTF-8 encoded ü is read expecting ISO-8859-1, when each byte is decoded on its own. %C3%BC is the URI-encoded representationg of both UTF-8 bytes of a UTF-8 ü. I think this is what's happening:

%C3%BC gets wrongly decoded to → ü which gets encoded to → %C3%83%C2%BC which then gets decoded again to → ü so you end up with überfall.

So I guess, you use the wrong encoding for decoding a URI-encoded string. This might have something to do with the internal encoding used by Java/the JVM:

By default, the JRE 7 installer installs a European languages version if it recognizes that the host operating system only supports European languages.

Wolfram
  • 8,044
  • 3
  • 45
  • 66
  • I did not set this parameter explicitly but I checked the Tomcat process in bash and it is obviously correctly set when starting it from within Eclipse. – KahPhi Jul 16 '12 at 15:00
  • I will try to set this also in eclipse.ini and see if it makes a difference. As '%' is '%25' in both UTF-8 and ISO encoding, the double encoding I mentioned in my edit would fit your assumption that there is ISO encoding set somewhere. The parameter map is also built in this servlet and saved as session attribute which is read in the jsp. The webapp queries a Rest service and has no problems in using utf-8 parameters, i.e. sending and receiving "überfall" as term without special coding. The mistake seems to happen between the browser and tomcat. – KahPhi Jul 16 '12 at 15:27
1

I think I fixed the problem now definitely.

Following Jontro's comment I encoded all URL parameter values once and removed the manual servlet-side decoding.

Sending an ü should look like %C3%BC in Firebug's Network tab which gave me ü in the servlet. Java was definitely set to "UTF-8" internal encoding with the -Dfile.encoding parameter. I traced the problem to the request.getParameter() method like this. request.getQueryString was ok, but when extracting the actual parameters it fails:

request.getCharacterEncoding()) => UTF-8
request.getContentType() => null
request.getQueryString() => from=0&resultCount=10&sortAsc=true&searchType=quick&term=%C3%BC
request.getParameter("term") => ü
Charset.defaultCharset() => UTF-8
OutputStreamWriter.getEncoding() => UTF8
new String(request.getParameter("term").getBytes(), UTF-8) => ü
System.getProperty("file.encoding") => UTF-8

By looking into the sources of Tomcat and Coyote which implement request.getParameter() i found the problem: the URIEncoding from the connector was always null and in this case it defaults to org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING which is "ISO-8859-1" like Wolfram said.

Long story short: my fault was editing the server.xml in Tomcat's conf directory which is only loaded ONCE into Eclipse when a new server is created in the servers view! After that, a separate server.xml in the Servers project has to be edited. After doing so, the connector setting is loaded correctly and everything works as it should.

Thanks for the comments! Hope this helps someone...

KahPhi
  • 73
  • 2
  • 8
  • 1
    I had a feeling that eclipse kept a copy somewhere, had the exact same problem. You just ended 6 hours of misery :) Thank you – GCon Jan 26 '14 at 22:27