3

I want to know if there is any way to parse an URL like this:

https://www.mysite.com/lot/of/unpleasant/folders/and/my/url with spaces &"others".xls

into

https://www.mysite.com/lot/of/unpleasant/folders/and/my/url%20with%20spaces%20&%22others%22.xls

Similar to the URL rewriting that Firefox does when just pasting the former url, sending it to the server (without response unless you have a site like this) and then copying the URL from the navigation bar and pasting it somewhere else.

Using URLEncoder#encode gives me this (undesired) output:

https%3A%2F%2Fwww.mysite.com%2Flot%2Fof%2Funpleasant%2Ffolders%2Fand%2Fmy%2Furl+with+spaces+%26%22others%22.xls

Sadly, I receive a String as shown at the beginning of the question so using URLEncoder#encode directly doesn't work.

I naively tried this:

String evilUrl = "https://www.mysite.com/lot/of/unpleasant/folders/and/my/url with spaces &\"others\".xls";
URI uri = null;
String[] urlParts = evilUrl.split("://");
String scheme = urlParts[0];
urlParts = urlParts[1].split("/");
String host = urlParts[0];
StringBuilder sb = new StringBuilder('/');
for (int i = 1; i < urlParts.length; i++) {
    sb.append('/');
    sb.append(urlParts[i]);
}
uri = new URI(scheme, host, sb.toString(), null);
System.out.println(uri.toASCIIString());

And gives this (better) output:

https://www.mysite.com/lot/of/unpleasant/folders/and/my/url%20with%20spaces%20&%22others%22.xls

But I'm not sure if there is an out-of-the-box solution there for this problem and I'm breaking my head for nothing or if I can rely that this piece of code can almost successfully solve my problem.


By the way, I already visited some resources on this topic:

Community
  • 1
  • 1
Luiggi Mendoza
  • 85,076
  • 16
  • 154
  • 332
  • Does does ```new URL(...)``` do the correct thing on your test cases? If it parses those spaces and special characters correctly, you should be able to regenerate it with escaped versions. But be warned, there's no way to tell what someone really means when there's a question mark in their URL unless you can just assume there's no query string. – David Ehrmann Apr 02 '14 at 23:41

1 Answers1

2

The problem with that sort of urls is that they are partially encoded, if you try to use an out-of-the-box encoder it will always encode the whole string, so I guess that your approach of using a custom encoder is correct. Your code is OK, you would just need to add some validations like, for instance, what if the "evil url" doesn't come with the protocol part (i. e. without the "https://") unless you're pretty sure it will never happen.

I have some spare time so I did an alternative custom encoder, the strategy I follow is to parse for chars that are not allowed in an URL and encode only those, rather than trying to re-encode the whole thing:

private static String encodeSemiEncoded(String semiEncondedUrl) {
    final String ALLOWED_CHAR = "!*'();:@&=+$,/?#[]-_.~";
    StringBuilder encoded = new StringBuilder();
    for(char ch: semiEncondedUrl.toCharArray()) {
        boolean shouldEncode = ALLOWED_CHAR.indexOf(ch) == -1 && !Character.isLetterOrDigit(ch) || ch > 127;
        if(shouldEncode) {
            encoded.append(String.format("%%%02X", (int)ch));
        } else {
            encoded.append(ch);
        }
    }
    return encoded.toString();
}

Hope this helps

morgano
  • 17,210
  • 10
  • 45
  • 56
  • Interesting way to use `String#format`. And works as expected. I just wonder how the white space activates the `shouldEncode` flag. – Luiggi Mendoza Apr 03 '14 at 00:44
  • it is just because white space is not in ALLOWED_CHAR and is not a letter or number (and ' ' = 32, which is less than 127) the extra validation (ch > 127) is for it to encode letters others than the ASCII set (like the spanish 'Ñ') – morgano Apr 03 '14 at 01:01
  • Oh ok, I think now I understand this algorithm. Pretty clever. – Luiggi Mendoza Apr 03 '14 at 01:08