URL decoding Japanese characters etc. in Java

Question

I have a servlet that receives some POST data. Because this data is x-www-form-urlencoded, a string such as サボテン would be encoded to サボテン.

How would I unencode this string back to the correct characters? I have tried using URLDecoder.decode("encoded string", "UTF-8"); but it doesn't make a difference.

The reason I would like to unencode them, is because, before I display this data on a webpage, I escape & to & and at the moment, it is escaping the &s in the encoded string so the characters are not showing up properly.

Answer by BalusC is correct wrt this using XML entity encoding, not URL encoding; but is the response actually XML? If it was, should just use XML parser -- and if not, service seems broken as one should return XML as XML, not just text fragments from within a doc. — StaxMan, Jan 11 '11 at 00:07
There's isn't any XML is there? The characters are received as HTML entities and sent back as HTML. — DanielGibbs, Jan 11 '11 at 00:11

BalusC · Accepted Answer · 2011-01-10T23:16:50.837

5

Those are not URL encodings. It would have looked like %E3%82%B5%E3%83%9C%E3%83%86%E3%83%B3. Those are decimal HTML/XML entities. To unescape HTML/XML entities, use Apache Commons Lang StringEscapeUtils.

Update as per the comments: you will get question marks when the response encoding is not UTF-8. If you're using JSP, just add the following line to top of the page:

<%@ page pageEncoding="UTF-8" %>

See for more detail the solutions about halfway this article. I would prefer using-UTF8-all-the-way above fiddling with regexps since regexps doesn't prepare you for world domination.

edited Jan 10 '11 at 23:16

answered Jan 10 '11 at 22:51

BalusC

1,082,665
372
3,610
3,555

Right, I tried StringEscapeUtils and it turns Japanese characters into ?s. So I think I will just not unencode and encode them again, rather I will use a regular expression to ignore the leading & when replacing & with & – DanielGibbs Jan 10 '11 at 23:09
Ok, I'm using servlets, not JSP, but I will have a look at the article. Good encoding is definitely preferable to being unequipped for when the machines take over. – DanielGibbs Jan 10 '11 at 23:24
@DanieL - It isn't StringEscapeUtils that is turning the kana into `'?'`. That happens when the kana (text stored in Java's 16-bit `char` type) is "encoded" into a stream of bytes. If the character encoding in use (defaults to ISO-8859-1, which is designed for western European languages) doesn't recognize a character, it will substitute a `'?'`. You need to use a more internationalization-friendly encoding like UTF-8. If this app caters to Japanese, you can use something like Shift-JIS, which is more space efficient (but can't handle other characters like Cyrillic or Thai). – erickson Jan 10 '11 at 23:26
1

@DanieL: in a servlet, do `response.setCharacterEncoding("UTF-8")` and accordingly `response.setContentType("text/html;charset=UTF-8");`. Note that printing HTML in a servlet instead of delegating the job to the JSP is not the best practice... – BalusC Jan 10 '11 at 23:30
I think I may have broken things a bit more, but you're saying I should receive the String with the entities etc. unescape it into a UTF-8 String and use that for all future purposes (output, database etc.)? – DanielGibbs Jan 10 '11 at 23:46
Since I have changed the encoding of the page to UTF-8, I have received the string as ãµããã³. Would I be correct in saying that this is because Tomcat os not handling the string properly? – DanielGibbs Jan 11 '11 at 00:26
This can happen when you encode the string as bytes using the incorrect encoding because you're "forced" to write them as bytes like `response.getOutputStream().write(string.getBytes())`. This will use the platform default encoding to convert chars to bytes which is not UTF-8 per se. You shouldn't be using outputstream to write character data, but a writer: `response.getWriter().write(string)`. Or, even better, use JSP as presentation. Note that I assume that you *did* set the response encoding as suggested in the answer and comment. Tomcat has nothing to do with this all. – BalusC Jan 11 '11 at 00:42
Yes, I set the page encoding. But now when I print out the raw string received from the POST it displays as ãµããã³. – DanielGibbs Jan 11 '11 at 01:15
Turns out I needed to do **request**.setCharacterEncoding("UTF-8"). Hopefully I can get it all working from here. Thanks for all your help! – DanielGibbs Jan 11 '11 at 01:30
Yes, when the page itself is instructed to use UTF-8, then the browser will just send the form data unescaped back as UTF-8, which you have to process as UTF-8 then (see also the linked article). After all you don't need the `StringEscapeUtils` then. – BalusC Jan 11 '11 at 02:16

score 1 · Answer 2 · answered Jan 11 '11 at 00:03

This is a feature/bug of browsers. If a web page is in a limited charset, say ASCII, and users type in some chars outside the charset in a form field, browsers will send these chars in the form of $#xxxx;

It can be a problem because if users actually type $#xxxx; they'll be sent as is. So the server has no way to distinguish the two cases.

The best way is to use a charset that covers all characters, like UTF-8, so browsers won't do this trick.

score 0 · Answer 3 · edited May 23 '17 at 12:04

0

Just a wild guess, but are you using Tomcat?

If so, make sure you have set up the Connector in Tomcat with a URIEncoding of UTF-8. Google that on the web and you will find a ton of hits such as

How to get UTF-8 working in Java webapps?

edited May 23 '17 at 12:04

Community

1
1

answered Jan 10 '11 at 22:51

rfeak

8,124
29
28

Yes I am, I've set the URIEncoding to UTF-8, but it hasn't made a difference, most likely because, as @BalusC pointed out, I am talking about HTML entities, not URL encoding. – DanielGibbs Jan 10 '11 at 23:10
1

Tomcat's URIEncoding applies to the URI in the first line of the HTTP request, *not* to the body of a POST request. – erickson Jan 10 '11 at 23:18
Exactly. It applies on GET only, not on POST. – BalusC Jan 10 '11 at 23:19

Byron Whitlock · Answer 4 · 2011-01-10T23:19:23.673

0

How about a regular expression?

Pattern pattern = Pattern.compile("&([^a][^m][^p][^;])?");
Matcher matcher = pattern.matcher(inputStr);
String output = matcher.replaceAll("&amp;$1");

edited Jan 10 '11 at 23:19

answered Jan 10 '11 at 22:51

Byron Whitlock

52,691
28
123
168

I think I will have to use a regular expression to replace & with & and ignore the HTML entities, but your one turns サボテン into &69;&08;&86;&31; – DanielGibbs Jan 10 '11 at 23:15
Alright, I've got "&(?!#\\d+;)" working. Is there any modifications that I should do to it? – DanielGibbs Jan 10 '11 at 23:19
I've just modified my example to replace the matched expression. It should work now. – Byron Whitlock Jan 10 '11 at 23:20
It works, but it appears to do the same thing as replaceAll("&", "&"). What I am trying to do is replace all the & with & except for html entities like サ – DanielGibbs Jan 10 '11 at 23:27

URL decoding Japanese characters etc. in Java

4 Answers4

Related