39

When I read the xml through a URL's InputStream, and then cut out everything except the url, I get "http://cliveg.bu.edu/people/sganguly/player/%20Rang%20De%20Basanti%20-%20Tu%20Bin%20Bataye.mp3".

As you can see, there are a lot of "%20"s.

I want the url to be unescaped.

Is there any way to do this in Java, without using a third-party library?

Penchant
  • 1,165
  • 7
  • 19
  • 28
  • Just to be pedantic, there is no such thing as "normal unicode". UTF8 is one of several ways to represent unicode text. But there is no "true" canonical representation. – jalf Mar 08 '09 at 17:07
  • As Jon and ng said, this has nothing to do with Unicode or UTF-8. You might want to change the title. – Alan Moore Mar 09 '09 at 05:48
  • The answer marked as correct now it is clearly wrong and should be removed. – freedev Jan 05 '21 at 22:03

4 Answers4

66

This is not unescaped XML, this is URL encoded text. Looks to me like you want to use the following on the URL strings.

URLDecoder.decode(url);

This will give you the correct text. The result of decoding the like you provided is this.

http://cliveg.bu.edu/people/sganguly/player/ Rang De Basanti - Tu Bin Bataye.mp3

The %20 is an escaped space character. To get the above I used the URLDecoder object.

ng.
  • 7,099
  • 1
  • 38
  • 42
18

Starting from Java 11 use

URLDecoder.decode(url, StandardCharsets.UTF_8).

for Java 7/8/9 use URLDecoder.decode(url, "UTF-8").

URLDecoder.decode(String s) has been deprecated since Java 5

Regarding the chosen encoding:

Note: The World Wide Web Consortium Recommendation states that UTF-8 should be used. Not doing so may introduce incompatibilites.

freedev
  • 25,946
  • 8
  • 108
  • 125
  • 1
    for Java 8 & 9 Use `URLDecoder.decode(s, "UTF-8");` – Black Oct 25 '21 at 12:13
  • Since Java 7 [StandardCharsets](https://github.com/openjdk-mirror/jdk7u-jdk/blob/f4d80957e89a19a29bb9f9807d2a28351ed7f7df/src/share/classes/java/nio/charset/StandardCharsets.java#L52). Am I wrong? – freedev Oct 25 '21 at 12:32
  • 1
    yes, but the URLDecoder method `decode` only takes (String, String) in Java 8 – Black Oct 25 '21 at 12:45
  • @user16320675 I'd considered that - but will it work with the underscore rather than hyphen in "UTF-8" ? – Black Oct 25 '21 at 22:19
  • Thank you to share. I never knew about this handy method before I read this answer! One cool feature of `URLDecoder.decode()` vs `new URI().getPath()`: The `URI` ctor will reject decoded URLs! `URLDecoder.decode()` will accept both encoded and decoded URLs, e.g., (decoded) `/path/to/here and there` and (encoded) `/path/to/here%20and%20there`. – kevinarpe Apr 29 '22 at 12:30
0

I'm having problems using this method when I have special characters like á, é, í, etc. My (probably wild) guess is widechars are not being encoded properly... well, at least I was expecting to see sequences like %uC2BF instead of %C2%BF.

Edited: My bad, this post explains the difference between URL encoding and JavaScript's escape sequences: URI encoding in UNICODE for apache httpclient 4

Community
  • 1
  • 1
Mario
  • 9
  • 1
0

In my case the URL contained escaped html entities, so StringEscapeUtils.unescapeHtml4() from apache-commons did the trick