Encoding Issue: "São Paulo" becomes "S%C3%A3o%20Paulo" then "SÃ£o Paulo"

Question

I have Spring application that is experiencing some encoding issues. When the client submits "São Paulo", I see it in the request header as:

=============>>> url is: /users/1825220/activity=update_fields&hometown=S%C3%A3o%20Paulo&usrId=1234 (PUT)

That is generated by dumping the request in the log as it comes in.

logger.info("\n=============>>> url is: " + request.getRequestURI() + "/" + request.getQueryString() + "  (" + request.getMethod() + ")");

The request is then passed to the method:

@RequestMapping(value = "/users/{id}", method = RequestMethod.PUT)
public @ResponseBody
OperationResponse updateUser(HttpServletRequest request,
        @PathVariable("id") Integer id,
        @RequestParam(value = "hometown", required = false) String homeTown) 
throws NoSuchAlgorithmException, UnsupportedEncodingException {

When I dump the value:

logger.debug("HOMETOWN=" + homeTown);

I get: HOMETOWN=SÃ£o Paulo

I am somewhat familiar with the basics of encoding and everything looks to be UTF-8, but evidently I do not know enough to figure this out. I have seen several topics on this, even with the same data, but I have not found anything that addresses it exactly that works.

I see that the values are correct. e.g.: The ã (in São) has these hex values. http://www.utf8-chartable.de/

U+00A3  £   c2 a3   POUND SIGN
U+00C3  Ã   c3 83   LATIN CAPITAL LETTER A WITH TILDE
U+00E3  ã   c3 a3   LATIN SMALL LETTER A WITH TILDE

The incoming values are the same from both a native iOS app and a website and via curl. For some reason, the ã (U+00E3) is being broken out into 4 bytes (%C3%A3) instead of 2 (%E3). I just can't figure out where the disconnect is.

What I need to do is preferably figure out what to change in the configuration some where rather than have to add code changes everywhere the data comes in.

score 1 · Answer 1 · edited May 23 '17 at 12:06

0xE3 (this is only 1 byte, by the way) is the value in most 8-bit encodings - notably iso8859 and cp1252 - for ã.

However, url encoding is often done in UTF-8 for better compatibility. Hence the 2 bytes, 0xC3 0xA3.

In your case, your server is reading this as if it were not 1 utf-8 character, but 2 iso (or cp) characters. Hence the result.

The solution suggested by AgilePro would work in most cases, however it would be cleaner to address the actual issue, by configuring your service to accept UTF-8, or to make sure that your client indicates the encoding they use.

This question may be related to this problem: Spring MVC UTF-8 Encoding

AgilePro · Answer 2 · 2014-10-10T22:03:25.810

The problem you are running into the is the standard UTF-8 encoding problem which happen commonly in URL parameters if they are not decoded in the right order.

For UTF-8, any character value greater than 127 is converted to a multi-byte sequence which is composed exclusively of byte values greater than 127. So your ã is properly being encoded into the two byte values. Then the byte values are are converted to %xx notation used by URL encoding.

To decode this, you need to do the opposite: convert the % notation into a stream of bytes, and then convert the bytes into a string using UTF-8 encoding. The problem is that some environments do this in the wrong order: they convert the byte stream to a string (decoding the UTF-8) and then they tackle the URL encoding. That is the wrong order.

There is a brute force solution to get yur value back, and that is to get the corrupted value, convert it back to bytes, and then convert to a string like this:

String val = new String(oldval.getBytes("iso-8859-1"), "UTF-8");

This is rather unsightly code, but it will convert the characters back.

Setting the HTTPRequest object into UTF-8 mode can solve this problem. Do it like this:

request.setCharacterEncoding("UTF-8");

This might work for Spring ... I am not sure when the headers get parsed. In the case of TomCat if you are using a JSP file, but the time your JSP file is invoked, it is too late to make this setting. The headers will already have been parsed. The official best way way to solve this is to insert a filter that makes this setting in the request object before the headers are parsed and the JSP is invoked. If you find setting the character encoding does not work ... try a Filter.

I read elsewhere that you can enable such a filter in Spring with this setting in your web.xml (but I dont have experience with this):

<filter>  
    <filter-name>encodingFilter</filter-name>  
    <filter-class>org.springframework.web.filter.CharacterEncodingFilter</filter-class>  
    <init-param>  
       <param-name>encoding</param-name>  
       <param-value>UTF-8</param-value>  
    </init-param>  
    <init-param>  
       <param-name>forceEncoding</param-name>  
       <param-value>true</param-value>  
    </init-param>  
</filter>  
<filter-mapping>  
    <filter-name>encodingFilter</filter-name>  
    <url-pattern>/*</url-pattern>  
</filter-mapping>

looks more like cp1252 (or was it macroman?) than like iso8859 to me — njzk2, Oct 10 '14 at 02:01
What is important here is that a character value between 128 and 255, gets converted to the same byte value between 128 and 255. cp1252 maps the byte values between 0x80 and 0x9F into unicode values that are not the same as the byte values. This would be a problem. iso-8859-1 is defined so that any byte value 0-255 maps to the same Unicode character 0-255. That is the behavior you need here. — AgilePro, Oct 10 '14 at 02:31
could be of importance if other chars are involved, but in this case after verification, these bytes have the same value in iso and cp — njzk2, Oct 10 '14 at 02:57
The first suggestion with "iso-8859-1" and "UTF-8" works, but as indicated it is unsightly and more or less a band-aid that does not address the primary cause. However, it does work, so thank you! setCharacterEncoding() had no effect. CharacterEncodingFilter was/is set in my app as indicated above, so that was not the cause. Just to reiterate, the issue occurs when coming from a mobile app and a website. — user231302, Oct 11 '14 at 03:41
Encoding in URL parameters is broken so often that I put an EXTRA hidden value with two japanese characters in it, and I test that hidden value on every form post, just to be sure that encoding is correct. Crazy, I know, but this makes it work every time. If you like the answer, click on the little check mark under the answer vote count to mark it as an accepted answer. — AgilePro, Oct 11 '14 at 18:34
I went with your suggestion to use: String val = new String(oldval.getBytes("iso-8859-1"), "UTF-8"); The strings are relatively short and this works every time. Thanks! — user231302, Apr 15 '15 at 01:21

Encoding Issue: "São Paulo" becomes "S%C3%A3o%20Paulo" then "SÃ£o Paulo"

2 Answers2