391

My Java standalone application gets a URL (which points to a file) from the user and I need to hit it and download it. The problem I am facing is that I am not able to encode the HTTP URL address properly...

Example:

URL:  http://search.barnesandnoble.com/booksearch/first book.pdf

java.net.URLEncoder.encode(url.toString(), "ISO-8859-1");

returns me:

http%3A%2F%2Fsearch.barnesandnoble.com%2Fbooksearch%2Ffirst+book.pdf

But, what I want is

http://search.barnesandnoble.com/booksearch/first%20book.pdf

(space replaced by %20)

I guess URLEncoder is not designed to encode HTTP URLs... The JavaDoc says "Utility class for HTML form encoding"... Is there any other way to do this?

Eddie
  • 53,828
  • 22
  • 125
  • 145
suDocker
  • 8,504
  • 6
  • 26
  • 26
  • 2
    See also http://stackoverflow.com/questions/10786042/java-url-encoding-of-query-string-parameters – Raedwald Jan 21 '16 at 15:58
  • The behaviour is entirely correct. URL encode is to turn something into a string that can be safely passed as a URL parameter, and isn't interpreted as a URL at all. Whereas you want it to just convert one small part of the URL. – Stephen Holt Jun 29 '17 at 08:37
  • 1
    Nitpicking: a string containing a whitespace character by definition is not a URI. So what you're looking for is code that implements the URI escaping defined in [Section 2.1 of RFC 3986](http://greenbytes.de/tech/webdav/rfc3986.html#rfc.section.2.1). – Julian Reschke Apr 07 '09 at 07:15

24 Answers24

321

The java.net.URI class can help; in the documentation of URL you find

Note, the URI class does perform escaping of its component fields in certain circumstances. The recommended way to manage the encoding and decoding of URLs is to use an URI

Use one of the constructors with more than one argument, like:

URI uri = new URI(
    "http", 
    "search.barnesandnoble.com", 
    "/booksearch/first book.pdf",
    null);
URL url = uri.toURL();
//or String request = uri.toString();

(the single-argument constructor of URI does NOT escape illegal characters)


Only illegal characters get escaped by above code - it does NOT escape non-ASCII characters (see fatih's comment).
The toASCIIString method can be used to get a String only with US-ASCII characters:

URI uri = new URI(
    "http", 
    "search.barnesandnoble.com", 
    "/booksearch/é",
    null);
String request = uri.toASCIIString();

For an URL with a query like http://www.google.com/ig/api?weather=São Paulo, use the 5-parameter version of the constructor:

URI uri = new URI(
        "http", 
        "www.google.com", 
        "/ig/api",
        "weather=São Paulo",
        null);
String request = uri.toASCIIString();
River
  • 8,585
  • 14
  • 54
  • 67
user85421
  • 28,957
  • 10
  • 64
  • 87
  • 17
    Please note, the URI class mentioned here is from "org.apache.commons.httpclient.URI" not "java.net" , the "java.net" doesn't URI doesn't accept the illegal characters, unless you will use constructors that builds URL from its components , like the way mentioned in Matt comment below – Mohamed Faramawi Jun 02 '10 at 20:47
  • 9
    @Mohamed: the class I mentioned and used for testing **actually is** `java.net.URI`: it worked perfectly (Java 1.6). I would mention the fully qualified class name if it was not the standard Java one and the link points to the documentation of `java.net.URI`. And, by the comment of Sudhakar, it solved the problem without including any "commons libraries"! – user85421 Jun 02 '10 at 21:07
  • 2
    URI uri = new URI("http", "search.barnesandnoble.com", "/booksearch/é",null); Does not do correct escaping with this sample? This should have been escaped with % escapes – fmucar Jan 19 '11 at 12:36
  • @fatih - that's correct, thanks! Normally that should not be a problem, but there is a simple solution - almost same as I wrote before. See 2nd edit. – user85421 Jan 19 '11 at 13:31
  • @Carlos Thx for the edit. Now it does escape but not correct escaping. It should be adding a % to the HEX value of char for Path params meaning é char should be converted to %e9 – fmucar Jan 19 '11 at 13:37
  • @fatih - not correct? Why not? The code uses just standard Java classes... It's returning `%C3%A9` for me, which is working perfectly (firefox and google). The URI standard (RFC2396) does not specify any particular character set to be used for encoding. The URI class uses UTF-8. Just because there are other options, does not mean the one is wrong. – user85421 Jan 19 '11 at 15:32
  • Sorry, I should have placed some link/more info in my previous reply. http://tools.ietf.org/html/rfc2396 The RFC2396 actually defines how to escape : " Data must be escaped if it does not have a representation using an unreserved character; this includes data that does not correspond to a printable character of the US-ASCII coded character set, or that corresponds to any US-ASCII character that is disallowed, as explained below."(2.4. Escape Sequences) So if the char is not in US-ASCII, then it needs to be escaped. – fmucar Jan 19 '11 at 16:09
  • (2.4.1) Escape encoding is done by prepending a % char to the 2 digit hex value of that character. '%C3%A9' so this is encoded in UTF8 but URI escaping needs to be done like the way defined in 2.4.1 and it needs to be exactly 3 digits. However, one bit which is confused is; RFC2396 does not say anything about the character set so if you are to include € in the URI, it needs to be converted to ('\u20AC') and then escape the resulting string as "%E2%82%AC". http://download.oracle.com/javase/6/docs/api/java/net/URI.html search for rfc2396, example is from javadocs of URI class itself. – fmucar Jan 19 '11 at 16:09
  • http://www.w3schools.com/TAGS/ref_urlencode.asp see the list of escaped chars values. Again, the table shows everything correctly but if you use the "URL Encode" functionality/button on the same page, you will see it is actually not escaping properly but returning utf-8 value for "é". :) – fmucar Jan 19 '11 at 16:12
  • @fatih - it is done like defined - (2.4.1): “An escaped *octet* is encoded as a character triplet, consisting of the percent character "%" followed by the two hexadecimal digits representing the *octet* code. …” An octet is not the same as a char… see 2.1 “URI and non-ASCII characters” from the same RFC2396. Surely not so simple... – user85421 Jan 19 '11 at 18:48
  • @Carlos, You are right, surely it is not char thats my poor wording just to explain. – fmucar Jan 19 '11 at 19:25
  • this one breaks at the ? . Any other solutions?`http://www.google.com/ig/api?weather=São Paulo` – Paulo Casaretto Jun 14 '11 at 22:19
  • @Paulo Casaretto - confira o `EDIT 3` que adicionei a minha resposta acima! (Check the `EDIT 3` that I added to my answer above) – user85421 Jun 18 '11 at 18:41
  • I tried this proposed solution, but it failed because it also escaped ampersand (&) character. – Jeff Axelrod May 01 '12 at 16:52
  • So, it seems that unicode characters (e.g. ã) are not being encoded in any way (except by toASCIIString()) nor are spaces being converted to '+'. The code from edit 3 returns "weather=São%20Paulo" as the query string. What steps should I take to escape the arguments to new URI()? – Edward Falk Dec 07 '12 at 19:46
  • @EdwardFalk This is "correct" behaviour: it appears that Java tried to add support for non-ASCII URIs before they were standardized as IRIs by [RFC 3987](http://tools.ietf.org/html/rfc3987); unhelpfully the spec for `java.net.URI` permits many unwise characters (e.g. Unicode control characters like directional formatting). Additionally, representing spaces with `+` is specific to HTML's [application/x-www-form-urlencoded](http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1); to do that, use the (unhelpfully-named) `java.net.URLEncoder` class. – tc. Mar 26 '13 at 18:32
  • Anyone found a solution? I have a query with multiple variables and €-signs, how to deal with that? – Diego Aug 22 '13 at 01:17
  • It does help. You can use `java.net.URL` to decompose the bad URL, and `java.net.URI` to put it back together correctly. For http(s) URLs anyway. (Ugh!) – Donal Fellows Nov 27 '13 at 15:59
  • for Android developers there is a somewhat more convenient alternative: `android.net.Uri.encode()` – Alexander Malakhov Jan 23 '15 at 10:55
97

Please be warned that most of the answers above are INCORRECT.

The URLEncoder class, despite is name, is NOT what needs to be here. It's unfortunate that Sun named this class so annoyingly. URLEncoder is meant for passing data as parameters, not for encoding the URL itself.

In other words, "http://search.barnesandnoble.com/booksearch/first book.pdf" is the URL. Parameters would be, for example, "http://search.barnesandnoble.com/booksearch/first book.pdf?parameter1=this&param2=that". The parameters are what you would use URLEncoder for.

The following two examples highlights the differences between the two.

The following produces the wrong parameters, according to the HTTP standard. Note the ampersand (&) and plus (+) are encoded incorrectly.

uri = new URI("http", null, "www.google.com", 80, 
"/help/me/book name+me/", "MY CRZY QUERY! +&+ :)", null);

// URI: http://www.google.com:80/help/me/book%20name+me/?MY%20CRZY%20QUERY!%20+&+%20:)

The following will produce the correct parameters, with the query properly encoded. Note the spaces, ampersands, and plus marks.

uri = new URI("http", null, "www.google.com", 80, "/help/me/book name+me/", URLEncoder.encode("MY CRZY QUERY! +&+ :)", "UTF-8"), null);

// URI: http://www.google.com:80/help/me/book%20name+me/?MY+CRZY+QUERY%2521+%252B%2526%252B+%253A%2529
Lii
  • 11,553
  • 8
  • 64
  • 88
Matt
  • 1,100
  • 7
  • 2
  • 2
    That's right, the URI constructor already encodes the querystring, according to the documentation http://docs.oracle.com/javase/1.4.2/docs/api/java/net/URI.html#URI(java.lang.String, java.lang.String, java.lang.String, int, java.lang.String, java.lang.String, java.lang.String) – madoke Oct 10 '12 at 14:17
  • 9
    @Draemon The answer is correct but uses the query string in an uncommon way; a more normal example might be `query = URLEncoder.encode(key) + "=" + URLEncoder.encode(value)`. The docs merely say that "any character that is not a legal URI character is quoted". – tc. Mar 13 '13 at 19:45
  • 1
    I agree with Matt here. If you type this URL: "http://www.google.com/help/me/book name+me/?MY CRZY QUERY! +&+ :)" in a browser, it automatically encodes the spaces but the "&" is used as query value separator and "+" are lost. – arcot Jan 30 '14 at 22:31
  • Unfortunately, this answer is _also_ wrong, because it double-encodes things. With the multi-param URI constructor, if you have slashes in your path, or '&' or '=' in your query params or values, you are either going to fail to encode these, or double encode them. – Scott Carey Aug 24 '20 at 12:52
91

I'm going to add one suggestion here aimed at Android users. You can do this which avoids having to get any external libraries. Also, all the search/replace characters solutions suggested in some of the answers above are perilous and should be avoided.

Give this a try:

String urlStr = "http://abc.dev.domain.com/0007AC/ads/800x480 15sec h.264.mp4";
URL url = new URL(urlStr);
URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
url = uri.toURL();

You can see that in this particular URL, I need to have those spaces encoded so that I can use it for a request.

This takes advantage of a couple features available to you in Android classes. First, the URL class can break a url into its proper components so there is no need for you to do any string search/replace work. Secondly, this approach takes advantage of the URI class feature of properly escaping components when you construct a URI via components rather than from a single string.

The beauty of this approach is that you can take any valid url string and have it work without needing any special knowledge of it yourself.

Craig B
  • 4,763
  • 1
  • 25
  • 19
  • 6
    Nice approach, but I would like to point out that this code does not prevent _double encoding_, e.g. %20 got encoded into %2520. [Scott's answer](http://stackoverflow.com/a/9542781/554894) does not suffer from this. – nattster Aug 03 '14 at 08:12
  • Or if you just want to do path quoting: new URI(null, null, "/path with spaces", null, null).toString() – user1050755 Nov 09 '14 at 05:54
  • 1
    @Stallman If your file name contains #, the URL class will put it into "ref" (equivalent of "fragment" in the URI class). You can detect whether URL.getRef() returns something that might be treated as a part of the path and pass URL.getPath() + "#" + URL.getRef() as the "path" parameter and null as the "fragment" parameter of the URI class 7 parameters constructor. By default, the string after # is treated as a reference (or an anchor). – gouessej Jan 07 '16 at 13:59
  • great answer, i have simple urls and it works for me. Although i don't think its very android specific. I used `java.net.URI` and `java.net.URL` and this answer was working perfectly. I am even able to unit test this. – ansh sachdeva Aug 24 '20 at 02:12
49

a solution i developed and much more stable than any other:

public class URLParamEncoder {

    public static String encode(String input) {
        StringBuilder resultStr = new StringBuilder();
        for (char ch : input.toCharArray()) {
            if (isUnsafe(ch)) {
                resultStr.append('%');
                resultStr.append(toHex(ch / 16));
                resultStr.append(toHex(ch % 16));
            } else {
                resultStr.append(ch);
            }
        }
        return resultStr.toString();
    }

    private static char toHex(int ch) {
        return (char) (ch < 10 ? '0' + ch : 'A' + ch - 10);
    }

    private static boolean isUnsafe(char ch) {
        if (ch > 128 || ch < 0)
            return true;
        return " %$&+,/:;=?@<>#%".indexOf(ch) >= 0;
    }

}
fmucar
  • 14,361
  • 2
  • 45
  • 50
  • 3
    that also requires you to break the url into pieces. There is no way for a computer to know which part of the url to encode. See my above edit – fmucar Aug 11 '11 at 16:34
  • 4
    @fmucar Thanks for that piece of code! It should be noted that this isn't UTF-8. To get UTF-8 just pre-process the input with `String utf8Input = new String(Charset.forName("UTF-8").encode(input).array());` (taken from [here](http://stackoverflow.com/questions/5729806/encode-string-to-utf-8/5729828#5729828)) – letmaik Oct 08 '11 at 19:44
  • Actually, I use it with a trim() and explicit encoding now although the latter is probably unnecessary: `new String(Charset.forName("UTF-8").encode(q).array(), "ISO-8859-1").trim();` The trim() is needed as encode() appends null bytes at the end which the String constructor doesn't remove. Don't know if it's fully correct, but works for me... – letmaik Oct 08 '11 at 21:40
  • 2
    This solution will actually also encode the "http://" part into "http%3A%2F%2F", which is what the initial question tried to avoid. – Benjamin Piette Jun 25 '13 at 08:21
  • 3
    You only pass what you need to encode, not the whole URL. There is no way to pass one whole URL string and expect correct encoding. In all cases, you need to break the url into its logical pieces. – fmucar Jun 25 '13 at 13:36
  • 2
    I had problems with this answer because it doesn't encode unsafe chars to UTF-8.. may be dependent on the peer application though. – Tarnschaf Oct 09 '13 at 12:07
  • This fails when a string having Chinese characters is passed as input: eg: "Test Sample-000001363/这是一个演示文件.docx" – mnagdev Jan 18 '22 at 07:13
40

If you have a URL, you can pass url.toString() into this method. First decode, to avoid double encoding (for example, encoding a space results in %20 and encoding a percent sign results in %25, so double encoding will turn a space into %2520). Then, use the URI as explained above, adding in all the parts of the URL (so that you don't drop the query parameters).

public URL convertToURLEscapingIllegalCharacters(String string){
    try {
        String decodedURL = URLDecoder.decode(string, "UTF-8");
        URL url = new URL(decodedURL);
        URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef()); 
        return uri.toURL(); 
    } catch (Exception ex) {
        ex.printStackTrace();
        return null;
    }
}
Scott Izu
  • 2,229
  • 25
  • 12
  • 2
    URLDecoder.decode(string, "UTF-8") fails with an IllegalArgumentException when you pass the string as "https://www.google.co.in/search?q=123%!123". This is a valid URL. I guess this API doesn't work when % is used as data instead of the encoding character. – MediumOne May 28 '15 at 13:10
27

Yeah URL encoding is going to encode that string so that it would be passed properly in a url to a final destination. For example you could not have http://stackoverflow.com?url=http://yyy.com. UrlEncoding the parameter would fix that parameter value.

So i have two choices for you:

  1. Do you have access to the path separate from the domain? If so you may be able to simply UrlEncode the path. However, if this is not the case then option 2 may be for you.

  2. Get commons-httpclient-3.1. This has a class URIUtil:

    System.out.println(URIUtil.encodePath("http://example.com/x y", "ISO-8859-1"));

This will output exactly what you are looking for, as it will only encode the path part of the URI.

FYI, you'll need commons-codec and commons-logging for this method to work at runtime.

SW4
  • 69,876
  • 20
  • 132
  • 137
Nathan Feger
  • 19,122
  • 11
  • 62
  • 71
  • Sidenote apache commons stopped maintaining URIUtil in 4.x branches apparently, recommending you use JDK's URI class instead. Just means you have to break up the string yourself. – Nicholi Jul 23 '14 at 22:44
  • 2) Exactly it is also suggested here http://stackoverflow.com/questions/5330104/encoding-url-query-parameters-in-java I also used `URIUtil` solution – To Kra Feb 05 '16 at 08:48
13

If anybody doesn't want to add a dependency to their project, these functions may be helpful.

We pass the 'path' part of our URL into here. You probably don't want to pass the full URL in as a parameter (query strings need different escapes, etc).

/**
 * Percent-encodes a string so it's suitable for use in a URL Path (not a query string / form encode, which uses + for spaces, etc)
 */
public static String percentEncode(String encodeMe) {
    if (encodeMe == null) {
        return "";
    }
    String encoded = encodeMe.replace("%", "%25");
    encoded = encoded.replace(" ", "%20");
    encoded = encoded.replace("!", "%21");
    encoded = encoded.replace("#", "%23");
    encoded = encoded.replace("$", "%24");
    encoded = encoded.replace("&", "%26");
    encoded = encoded.replace("'", "%27");
    encoded = encoded.replace("(", "%28");
    encoded = encoded.replace(")", "%29");
    encoded = encoded.replace("*", "%2A");
    encoded = encoded.replace("+", "%2B");
    encoded = encoded.replace(",", "%2C");
    encoded = encoded.replace("/", "%2F");
    encoded = encoded.replace(":", "%3A");
    encoded = encoded.replace(";", "%3B");
    encoded = encoded.replace("=", "%3D");
    encoded = encoded.replace("?", "%3F");
    encoded = encoded.replace("@", "%40");
    encoded = encoded.replace("[", "%5B");
    encoded = encoded.replace("]", "%5D");
    return encoded;
}

/**
 * Percent-decodes a string, such as used in a URL Path (not a query string / form encode, which uses + for spaces, etc)
 */
public static String percentDecode(String encodeMe) {
    if (encodeMe == null) {
        return "";
    }
    String decoded = encodeMe.replace("%21", "!");
    decoded = decoded.replace("%20", " ");
    decoded = decoded.replace("%23", "#");
    decoded = decoded.replace("%24", "$");
    decoded = decoded.replace("%26", "&");
    decoded = decoded.replace("%27", "'");
    decoded = decoded.replace("%28", "(");
    decoded = decoded.replace("%29", ")");
    decoded = decoded.replace("%2A", "*");
    decoded = decoded.replace("%2B", "+");
    decoded = decoded.replace("%2C", ",");
    decoded = decoded.replace("%2F", "/");
    decoded = decoded.replace("%3A", ":");
    decoded = decoded.replace("%3B", ";");
    decoded = decoded.replace("%3D", "=");
    decoded = decoded.replace("%3F", "?");
    decoded = decoded.replace("%40", "@");
    decoded = decoded.replace("%5B", "[");
    decoded = decoded.replace("%5D", "]");
    decoded = decoded.replace("%25", "%");
    return decoded;
}

And tests:

@Test
public void testPercentEncode_Decode() {
    assertEquals("", percentDecode(percentEncode(null)));
    assertEquals("", percentDecode(percentEncode("")));

    assertEquals("!", percentDecode(percentEncode("!")));
    assertEquals("#", percentDecode(percentEncode("#")));
    assertEquals("$", percentDecode(percentEncode("$")));
    assertEquals("@", percentDecode(percentEncode("@")));
    assertEquals("&", percentDecode(percentEncode("&")));
    assertEquals("'", percentDecode(percentEncode("'")));
    assertEquals("(", percentDecode(percentEncode("(")));
    assertEquals(")", percentDecode(percentEncode(")")));
    assertEquals("*", percentDecode(percentEncode("*")));
    assertEquals("+", percentDecode(percentEncode("+")));
    assertEquals(",", percentDecode(percentEncode(",")));
    assertEquals("/", percentDecode(percentEncode("/")));
    assertEquals(":", percentDecode(percentEncode(":")));
    assertEquals(";", percentDecode(percentEncode(";")));

    assertEquals("=", percentDecode(percentEncode("=")));
    assertEquals("?", percentDecode(percentEncode("?")));
    assertEquals("@", percentDecode(percentEncode("@")));
    assertEquals("[", percentDecode(percentEncode("[")));
    assertEquals("]", percentDecode(percentEncode("]")));
    assertEquals(" ", percentDecode(percentEncode(" ")));

    // Get a little complex
    assertEquals("[]]", percentDecode(percentEncode("[]]")));
    assertEquals("a=d%*", percentDecode(percentEncode("a=d%*")));
    assertEquals(")  (", percentDecode(percentEncode(")  (")));
    assertEquals("%21%20%2A%20%27%20%28%20%25%20%29%20%3B%20%3A%20%40%20%26%20%3D%20%2B%20%24%20%2C%20%2F%20%3F%20%23%20%5B%20%5D%20%25",
                    percentEncode("! * ' ( % ) ; : @ & = + $ , / ? # [ ] %"));
    assertEquals("! * ' ( % ) ; : @ & = + $ , / ? # [ ] %", percentDecode(
                    "%21%20%2A%20%27%20%28%20%25%20%29%20%3B%20%3A%20%40%20%26%20%3D%20%2B%20%24%20%2C%20%2F%20%3F%20%23%20%5B%20%5D%20%25"));

    assertEquals("%23456", percentDecode(percentEncode("%23456")));

}
Cuga
  • 17,668
  • 31
  • 111
  • 166
11

Unfortunately, org.apache.commons.httpclient.util.URIUtil is deprecated, and the replacement org.apache.commons.codec.net.URLCodec does coding suitable for form posts, not in actual URL's. So I had to write my own function, which does a single component (not suitable for entire query strings that have ?'s and &'s)

public static String encodeURLComponent(final String s)
{
  if (s == null)
  {
    return "";
  }

  final StringBuilder sb = new StringBuilder();

  try
  {
    for (int i = 0; i < s.length(); i++)
    {
      final char c = s.charAt(i);

      if (((c >= 'A') && (c <= 'Z')) || ((c >= 'a') && (c <= 'z')) ||
          ((c >= '0') && (c <= '9')) ||
          (c == '-') ||  (c == '.')  || (c == '_') || (c == '~'))
      {
        sb.append(c);
      }
      else
      {
        final byte[] bytes = ("" + c).getBytes("UTF-8");

        for (byte b : bytes)
        {
          sb.append('%');

          int upper = (((int) b) >> 4) & 0xf;
          sb.append(Integer.toHexString(upper).toUpperCase(Locale.US));

          int lower = ((int) b) & 0xf;
          sb.append(Integer.toHexString(lower).toUpperCase(Locale.US));
        }
      }
    }

    return sb.toString();
  }
  catch (UnsupportedEncodingException uee)
  {
    throw new RuntimeException("UTF-8 unsupported!?", uee);
  }
}
takrl
  • 6,356
  • 3
  • 60
  • 69
Jeff Tsay
  • 136
  • 1
  • 2
10

URLEncoding can encode HTTP URLs just fine, as you've unfortunately discovered. The string you passed in, "http://search.barnesandnoble.com/booksearch/first book.pdf", was correctly and completely encoded into a URL-encoded form. You could pass that entire long string of gobbledigook that you got back as a parameter in a URL, and it could be decoded back into exactly the string you passed in.

It sounds like you want to do something a little different than passing the entire URL as a parameter. From what I gather, you're trying to create a search URL that looks like "http://search.barnesandnoble.com/booksearch/whateverTheUserPassesIn". The only thing that you need to encode is the "whateverTheUserPassesIn" bit, so perhaps all you need to do is something like this:

String url = "http://search.barnesandnoble.com/booksearch/" + 
       URLEncoder.encode(userInput,"UTF-8");

That should produce something rather more valid for you.

Brandon Yarbrough
  • 37,021
  • 23
  • 116
  • 145
8

I read the previous answers to write my own method because I could not have something properly working using the solution of the previous answers, it looks good for me but if you can find URL that does not work with this, please let me know.

public static URL convertToURLEscapingIllegalCharacters(String toEscape) throws MalformedURLException, URISyntaxException {
            URL url = new URL(toEscape);
            URI uri = new URI(url.getProtocol(), url.getUserInfo(), url.getHost(), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
            //if a % is included in the toEscape string, it will be re-encoded to %25 and we don't want re-encoding, just encoding
            return new URL(uri.toString().replace("%25", "%"));
}
Emilien Brigand
  • 9,943
  • 8
  • 32
  • 37
8

There is still a problem if you have got an encoded "/" (%2F) in your URL.

RFC 3986 - Section 2.2 says: "If data for a URI component would conflict with a reserved character's purpose as a delimiter, then the conflicting data must be percent-encoded before the URI is formed." (RFC 3986 - Section 2.2)

But there is an Issue with Tomcat:

http://tomcat.apache.org/security-6.html - Fixed in Apache Tomcat 6.0.10

important: Directory traversal CVE-2007-0450

Tomcat permits '\', '%2F' and '%5C' [...] .

The following Java system properties have been added to Tomcat to provide additional control of the handling of path delimiters in URLs (both options default to false):

  • org.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH: true|false
  • org.apache.catalina.connector.CoyoteAdapter.ALLOW_BACKSLASH: true|false

Due to the impossibility to guarantee that all URLs are handled by Tomcat as they are in proxy servers, Tomcat should always be secured as if no proxy restricting context access was used.

Affects: 6.0.0-6.0.9

So if you have got an URL with the %2F character, Tomcat returns: "400 Invalid URI: noSlash"

You can switch of the bugfix in the Tomcat startup script:

set JAVA_OPTS=%JAVA_OPTS% %LOGGING_CONFIG%   -Dorg.apache.tomcat.util.buf.UDecoder.ALLOW_ENCODED_SLASH=true 
simonox
  • 559
  • 6
  • 6
5

Maybe can try UriUtils in org.springframework.web.util

UriUtils.encodeUri(input, "UTF-8")
micahli123
  • 460
  • 1
  • 7
  • 10
4

You can also use GUAVA and path escaper: UrlEscapers.urlFragmentEscaper().escape(relativePath)

To Kra
  • 3,344
  • 3
  • 38
  • 45
4

I agree with Matt. Indeed, I've never seen it well explained in tutorials, but one matter is how to encode the URL path, and a very different one is how to encode the parameters which are appended to the URL (the query part, behind the "?" symbol). They use similar encoding, but not the same.

Specially for the encoding of the white space character. The URL path needs it to be encoded as %20, whereas the query part allows %20 and also the "+" sign. The best idea is to test it by ourselves against our Web server, using a Web browser.

For both cases, I ALWAYS would encode COMPONENT BY COMPONENT, never the whole string. Indeed URLEncoder allows that for the query part. For the path part you can use the class URI, although in this case it asks for the entire string, not a single component.

Anyway, I believe that the best way to avoid these problems is to use a personal non-conflictive design. How? For example, I never would name directories or parameters using other characters than a-Z, A-Z, 0-9 and _ . That way, the only need is to encode the value of every parameter, since it may come from an user input and the used characters are unknown.

negora
  • 151
  • 1
  • 3
  • 4
3

I took the content above and changed it around a bit. I like positive logic first, and I thought a HashSet might give better performance than some other options, like searching through a String. Although, I'm not sure if the autoboxing penalty is worth it, but if the compiler optimizes for ASCII chars, then the cost of boxing will be low.

/***
 * Replaces any character not specifically unreserved to an equivalent 
 * percent sequence.
 * @param s
 * @return
 */
public static String encodeURIcomponent(String s)
{
    StringBuilder o = new StringBuilder();
    for (char ch : s.toCharArray()) {
        if (isSafe(ch)) {
            o.append(ch);
        }
        else {
            o.append('%');
            o.append(toHex(ch / 16));
            o.append(toHex(ch % 16));
        }
    }
    return o.toString();
}

private static char toHex(int ch)
{
    return (char)(ch < 10 ? '0' + ch : 'A' + ch - 10);
}

// https://tools.ietf.org/html/rfc3986#section-2.3
public static final HashSet<Character> UnreservedChars = new HashSet<Character>(Arrays.asList(
        'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
        'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',
        '0','1','2','3','4','5','6','7','8','9',
        '-','_','.','~'));
public static boolean isSafe(char ch)
{
    return UnreservedChars.contains(ch);
}
ChrisG65
  • 59
  • 2
2

Use the following standard Java solution (passes around 100 of the testcases provided by Web Plattform Tests):

0. Test if URL is already encoded.

1. Split URL into structural parts. Use java.net.URL for it.

2. Encode each structural part properly!

3. Use IDN.toASCII(putDomainNameHere) to Punycode encode the host name!

4. Use java.net.URI.toASCIIString() to percent-encode, NFC encoded unicode - (better would be NFKC!).

Find more here: https://stackoverflow.com/a/49796882/1485527

jschnasse
  • 8,526
  • 6
  • 32
  • 72
2

If you are using spring, you can try org.springframework.web.util.UriUtils#encodePath

Nick Allen
  • 1,647
  • 14
  • 20
2

In addition to the Carlos Heuberger's reply: if a different than the default (80) is needed, the 7 param constructor should be used:

URI uri = new URI(
        "http",
        null, // this is for userInfo
        "www.google.com",
        8080, // port number as int
        "/ig/api",
        "weather=São Paulo",
        null);
String request = uri.toASCIIString();
Martin Dimitrov
  • 4,796
  • 5
  • 46
  • 62
0

I had the same problem. Solved this by unsing:

android.net.Uri.encode(urlString, ":/");

It encodes the string but skips ":" and "/".

Richard R
  • 873
  • 6
  • 20
0

I've created a new project to help construct HTTP URLs. The library will automatically URL encode path segments and query parameters.

You can view the source and download a binary at https://github.com/Widen/urlbuilder

The example URL in this question:

new UrlBuilder("search.barnesandnoble.com", "booksearch/first book.pdf").toString()

produces

http://search.barnesandnoble.com/booksearch/first%20book.pdf

Uriah Carpenter
  • 6,656
  • 32
  • 28
-1

I develop a library that serves this purpose: galimatias. It parses URL the same way web browsers do. That is, if a URL works in a browser, it will be correctly parsed by galimatias.

In this case:

// Parse
io.mola.galimatias.URL.parse(
    "http://search.barnesandnoble.com/booksearch/first book.pdf"
).toString()

Will give you: http://search.barnesandnoble.com/booksearch/first%20book.pdf. Of course this is the simplest case, but it'll work with anything, way beyond java.net.URI.

You can check it out at: https://github.com/smola/galimatias

smola
  • 863
  • 8
  • 15
  • I'm not sure why this answer was downvoted so much. This library albeit a bit big in footprint does exactly what I need. – Lakatos Gyula Nov 23 '21 at 23:43
-2

i use this

org.apache.commons.text.StringEscapeUtils.escapeHtml4("my text % & < >");

add this dependecy

 <dependency>
        <groupId>org.apache.commons</groupId>
        <artifactId>commons-text</artifactId>
        <version>1.8</version>
    </dependency>
developer learn999
  • 365
  • 1
  • 4
  • 17
-3

You can use a function like this. Complete and modify it to your need :

/**
     * Encode URL (except :, /, ?, &, =, ... characters)
     * @param url to encode
     * @param encodingCharset url encoding charset
     * @return encoded URL
     * @throws UnsupportedEncodingException
     */
    public static String encodeUrl (String url, String encodingCharset) throws UnsupportedEncodingException{
            return new URLCodec().encode(url, encodingCharset).replace("%3A", ":").replace("%2F", "/").replace("%3F", "?").replace("%3D", "=").replace("%26", "&");
    }

Example of use :

String urlToEncode = ""http://www.growup.com/folder/intérieur-à_vendre?o=4";
Utils.encodeUrl (urlToEncode , "UTF-8")

The result is : http://www.growup.com/folder/int%C3%A9rieur-%C3%A0_vendre?o=4

Salim Hamidi
  • 20,731
  • 1
  • 26
  • 31
-7

How about:

public String UrlEncode(String in_) {

String retVal = "";

try {
    retVal = URLEncoder.encode(in_, "UTF8");
} catch (UnsupportedEncodingException ex) {
    Log.get().exception(Log.Level.Error, "urlEncode ", ex);
}

return retVal;

}

Kladskull
  • 10,332
  • 20
  • 69
  • 111