36

Suppose I have:

<a href="http://www.yahoo.com/" target="_yahoo" 
    title="Yahoo!&#8482;" onclick="return gateway(this);">Yahoo!</a>
<script type="text/javascript">
function gateway(lnk) {
    window.open(SERVLET +
        '?external_link=' + encodeURIComponent(lnk.href) +
        '&external_target=' + encodeURIComponent(lnk.target) +
        '&external_title=' + encodeURIComponent(lnk.title));
    return false;
}
</script>

I have confirmed external_title gets encoded as Yahoo!%E2%84%A2 and passed to SERVLET. If in SERVLET I do:

Writer writer = response.getWriter();
writer.write(request.getParameter("external_title"));

I get Yahoo!â„¢ in the browser. If I manually switch the browser character encoding to UTF-8, it changes to Yahoo!TM (which is what I want).

So I figured the encoding I was sending to the browser was wrong (it was Content-type: text/html; charset=ISO-8859-1). I changed SERVLET to:

response.setContentType("text/html; charset=utf-8");
Writer writer = response.getWriter();
writer.write(request.getParameter("external_title"));

Now the browser character encoding is UTF-8, but it outputs Yahoo!⢠and I can't get the browser to render the correct character at all.

My question is: is there some combination of Content-type and/or new String(request.getParameter("external_title").getBytes(), "UTF-8"); and/or something else that will result in Yahoo!TM appearing in the SERVLET output?

Grant Wagner
  • 25,263
  • 7
  • 54
  • 64

8 Answers8

46

You are nearly there. EncodeURIComponent correctly encodes to UTF-8, which is what you should always use in a URL today.

The problem is that the submitted query string is getting mutilated on the way into your server-side script, because getParameter() uses ISO-8559-1 instead of UTF-8. This stems from Ancient Times before the web settled on UTF-8 for URI/IRI, but it's rather pathetic that the Servlet spec hasn't been updated to match reality, or at least provide a reliable, supported option for it.

(There is request.setCharacterEncoding in Servlet 2.3, but it doesn't affect query string parsing, and if a single parameter has been read before, possibly by some other framework element, it won't work at all.)

So you need to futz around with container-specific methods to get proper UTF-8, often involving stuff in server.xml. This totally sucks for distributing web apps that should work anywhere. For Tomcat see https://cwiki.apache.org/confluence/display/TOMCAT/Character+Encoding and also What's the difference between "URIEncoding" of Tomcat, Encoding Filter and request.setCharacterEncoding.

holmis83
  • 15,922
  • 5
  • 82
  • 83
bobince
  • 528,062
  • 107
  • 651
  • 834
  • 5
    Thanks for the explanation. At least I know I'm not crazy. I tried request.setCharacterEncoding() while looking for a solution and as you said, it didn't seem to do anything to help resolve my problem. – Grant Wagner Jan 22 '09 at 19:49
  • And here is a link for Jetty if anyone is using it (by default Jetty 6+ uses UTF-8 unless configured otherwise): http://docs.codehaus.org/display/JETTY/International+Characters+and+Character+Encodings – Riyad Kalla Jul 16 '11 at 22:08
  • 1
    `request.getParameter("name")` prints as `ÏηγÏÏÏÏη`. `request.getQueryString()` prints as `name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7` - which if passed to `URLDecoder.decode()` is decoded fine. Could you please comment on _why does not `getParameter()` return the percent encoded string_ ? Is not ISO-8559-1 a superset of ASCII ? – Mr_and_Mrs_D Oct 06 '12 at 20:43
  • 2
    `getParameter` is intended to take care of decoding the input for you - browsers encode form values with percents when submitted so you have to decode them to get the user's input. There has to be some encoding used to turn the bytes in the input into characters, and browsers don't always use the same encoding. Unfortunately Servlet chooses one for you, it doesn't choose well, and it doesn't let you override that choice - unlike `URLDecoder.decode` there is no `enc` argument. – bobince Oct 07 '12 at 11:38
  • 2
    If you want the percent-encoded content from the raw URL, use `getQueryString()` and parse it yourself instead of letting Servlet do it. – bobince Oct 07 '12 at 11:39
21

I got the same problem and solved it by decoding Request.getQueryString() using URLDecoder(), and after extracting my parameters.

String[] Parameters = URLDecoder.decode(Request.getQueryString(), 'UTF-8')
                       .splitat('&');
Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
Modi
  • 211
  • 2
  • 2
  • 4
    Handling the query string yourself is a good idea to deal with the problems in `getParameter`, however this isn't quite right: it should URL-decode *after* splitting the components apart, rather than before. The code above would fail for any use of the `&` character in parameters (encoded to `%26`), or `=` in parameter names (`%3D`). – bobince Apr 04 '15 at 10:30
  • 1
    what about POST parameters ? – lmo Nov 04 '15 at 13:23
  • See http://stackoverflow.com/questions/4128436/query-string-manipulation-in-java/ for several manual decoding examples. – Vadzim Apr 23 '16 at 21:19
18

There is way to do it in java (no fiddling with server.xml)

Do not work :

protected static final String CHARSET_FOR_URL_ENCODING = "UTF-8";

String uname = request.getParameter("name");
System.out.println(uname);
// ÏηγÏÏÏÏη
uname = request.getQueryString();
System.out.println(uname);
// name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7
uname = URLDecoder.decode(request.getParameter("name"),
        CHARSET_FOR_URL_ENCODING);
System.out.println(uname);
// ÏηγÏÏÏÏη // !!!!!!!!!!!!!!!!!!!!!!!!!!!
uname = URLDecoder.decode(
        "name=%CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7",
        CHARSET_FOR_URL_ENCODING);
System.out.println("query string decoded : " + uname);
// query string decoded : name=τηγρτσςη
uname = URLDecoder.decode(new String(request.getParameter("name")
        .getBytes()), CHARSET_FOR_URL_ENCODING);
System.out.println(uname);
// ÏηγÏÏÏÏη // !!!!!!!!!!!!!!!!!!!!!!!!!!!

Works :

final String name = URLDecoder
        .decode(new String(request.getParameter("name").getBytes(
                "iso-8859-1")), CHARSET_FOR_URL_ENCODING);
System.out.println(name);
// τηγρτσςη

Worked but will break if default encoding != utf-8 - try this instead (omit the call to decode() it's not needed):

final String name = new String(request.getParameter("name").getBytes("iso-8859-1"),
        CHARSET_FOR_URL_ENCODING);

As I said above if the server.xml is messed with as in :

<Connector connectionTimeout="20000" port="8080" protocol="HTTP/1.1"
                     redirectPort="8443"  URIEncoding="UTF-8"/> 

(notice the URIEncoding="UTF-8") the code above will break (cause the getBytes("iso-8859-1") should read getBytes("UTF-8")). So for a bullet proof solution you have to get the value of the URIEncoding attribute. This unfortunately seems to be container specific - even worse container version specific. For tomcat 7 you'd need something like :

import javax.management.AttributeNotFoundException;
import javax.management.InstanceNotFoundException;
import javax.management.MBeanException;
import javax.management.MBeanServer;
import javax.management.MBeanServerFactory;
import javax.management.MalformedObjectNameException;
import javax.management.ObjectName;
import javax.management.ReflectionException;

import org.apache.catalina.Server;
import org.apache.catalina.Service;
import org.apache.catalina.connector.Connector;

public class Controller extends HttpServlet {

    // ...
    static String CHARSET_FOR_URI_ENCODING; // the `URIEncoding` attribute
    static {
        MBeanServer mBeanServer = MBeanServerFactory.findMBeanServer(null).get(
            0);
        ObjectName name = null;
        try {
            name = new ObjectName("Catalina", "type", "Server");
        } catch (MalformedObjectNameException e1) {
            e1.printStackTrace();
        }
        Server server = null;
        try {
            server = (Server) mBeanServer.getAttribute(name, "managedResource");
        } catch (AttributeNotFoundException | InstanceNotFoundException
                | MBeanException | ReflectionException e) {
            e.printStackTrace();
        }
        Service[] services = server.findServices();
        for (Service service : services) {
            for (Connector connector : service.findConnectors()) {
                System.out.println(connector);
                String uriEncoding = connector.getURIEncoding();
                System.out.println("URIEncoding : " + uriEncoding);
                boolean use = connector.getUseBodyEncodingForURI();
                // TODO : if(use && connector.get uri enc...)
                CHARSET_FOR_URI_ENCODING = uriEncoding;
                // ProtocolHandler protocolHandler = connector
                // .getProtocolHandler();
                // if (protocolHandler instanceof Http11Protocol
                // || protocolHandler instanceof Http11AprProtocol
                // || protocolHandler instanceof Http11NioProtocol) {
                // int serverPort = connector.getPort();
                // System.out.println("HTTP Port: " + connector.getPort());
                // }
            }
        }
    }
}

And still you need to tweak this for multiple connectors (check the commented out parts). Then you would use something like :

new String(parameter.getBytes(CHARSET_FOR_URI_ENCODING), CHARSET_FOR_URL_ENCODING);

Still this may fail (IIUC) if parameter = request.getParameter("name"); decoded with CHARSET_FOR_URI_ENCODING was corrupted so the bytes I get with getBytes() were not the original ones (that's why "iso-8859-1" is used by default - it will preserve the bytes). You can get rid of it all by manually parsing the query string in the lines of:

URLDecoder.decode(request.getQueryString().split("=")[1],
        CHARSET_FOR_URL_ENCODING);

I am still looking for the place in the docs where it is mentioned that request.getParameter("name") does call URLDecoder.decode() instead of returning the %CF%84%CE%B7%CE%B3%CF%81%CF%84%CF%83%CF%82%CE%B7 string ? A link in the source would be much appreciated.
Also how can I pass as the parameter's value the string, say, %CE ? => see comment : parameter=%25CE

Community
  • 1
  • 1
Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
  • 1
    If you want to pass `%CE` you simply encode it, so `parameter=%25CE` – Bart van Heukelom Oct 25 '12 at 11:09
  • yes, I prefer leaving platform configurations untouched as much as possible. I'll enter the ISO-Charset in my custom servlets configuration (custom properties in tomcat/conf), so I can change it at runtime or even adjust it in new server deployments - if needed. Specifications should always rule over customizations. – Gunnar May 17 '17 at 08:29
2

I suspect that the data mutilation happens in the request, i.e. the declared encoding of the request does not match the one that is actually used for the data.

What does request.getCharacterEncoding() return?

I don't really know how JavaScript handles encodings or how to make it use a specific one.

You need to make sure that encodings are used correctly at all stages - do NOT try to "fix" the data by using new String() an getBytes() at a point where it has already been encoded incorrectly.

Edit: It may help to have the origin page (the one with the Javascript) also encoded in UTF-8 and declared as such in its Content-Type. Then I believe Javascript may default to using UTF-8 for its request - but this is not definite knowledge, just guesswork.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • request.getCharacterEncoding() is returning ISO-8859-1. So I think the problem is that encodeURIComponent() encodes the value as UTF-8, but it is getting mangled by the request encoding of ISO-8859-1. – Grant Wagner Jan 22 '09 at 17:31
0

There is a bug in certain versions of Jetty that makes it parse higher number UTF-8 characters incorrectly. If your server accepts arabic letters correctly but not emoji, that's a sign you have a version with this problem, since arabic is not in ISO-8859-1, but is in the lower range of UTF-8 characters ("lower" meaning java will represent it in a single char).

I updated from version 7.2.0.v20101020 to version 7.5.4.v20111024 and this fixed the problem; I can now use the getParameter(String) method instead of having to parse it myself.

If you're really curious, you can dig into your version of org.eclipse.jetty.util.Utf8StringBuilder.append(byte) and see whether it correctly adds multiple chars to the string when the utf-8 code is high enough or if, as in 7.2.0, it simply casts an int to a char and appends.

Ben B
  • 78
  • 5
0

You could always use javascript to manipulate the text further.

<div id="test">a</div>
<script>
var a = document.getElementById('test');
alert(a.innerHTML);
a.innerHTML = decodeURI("Yahoo!%E2%84%A2");
alert(a.innerHTML);
</script>
jacobangel
  • 6,896
  • 2
  • 34
  • 35
  • Yes, decodeURIComponent() returns the correct value, but only if I extract the value from the URL in JavaScript. If I attempt to decodeURIComponent('<%= request.getParameter("external_title") %>'); I don't get the correct value. – Grant Wagner Jan 22 '09 at 17:32
0

I think I can get the following to work:

encodeURIComponent(escape(lnk.title))

That gives me %25u2122 (for &#8482) or %25AE (for &#174), which will decode to %u2122 and %AE respectively in the servlet.

I should then be able to turn %u2122 into '\u2122' and %AE into '\u00AE' relatively easily using (char) (base-10 integer value of %uXXXX or %XX) in a match and replace loop using regular expressions.

i.e. - match /%u([0-9a-f]{4})/i, extract the matching subexpression, convert it to base-10, turn it into a char and append it to the output, then do the same with /%([0-9a-f]{2})/i

Grant Wagner
  • 25,263
  • 7
  • 54
  • 64
  • This is one possible encoding scheme you could use to get around the Servlet Parameter Charset Problem. (One that didn't use the dodgy JavaScript escape() function might be better.) But any such isn't the standard way to pass parameters in, so any other scripts/forms wouldn't be able to talk to it. – bobince Jan 22 '09 at 18:39
  • 1
    I agree that using escape() isn't the best option, but I'd rather not write my own encoding routine in JavaScript. I've tested my design using escape() in IE6, 7 & 8, Firefox 2 & 3, Opera 9.6, Safari for Windows 3.2.1 and Google Chrome and it works consistently for those browsers. – Grant Wagner Jan 22 '09 at 20:13
0

Thanks for all I get to know about encoding decoding of default character set that use in tomcat, jetty I use this method to solve my problems using google guava

        String str = URLDecoder.decode(request.getQueryString(), StandardCharsets.UTF_8.name());
        final Map<String, String> map = Splitter.on('&').trimResults().withKeyValueSeparator("=").split(str);
        System.out.println(map);
        System.out.println(map.get("aung"));
        System.out.println(map.get("aa"));
Aung Aung
  • 227
  • 1
  • 2
  • 6