7

Apparently brackets are not allowed in URI paths.

I'm not sure if this is a Tomcat problem but I'm getting request with paths that contains ].

In otherwords

request.getRequestURL() == "http://localhost:8080/a]b"
request.getRequestURI() == "/a]b"

BTW getRequestURL() and URI are generally escaped ie for http://localhost:8080/a b

request.getRequestURL() == "http://localhost:8080/a%20b"

So if you try to do:

new URI("http://localhost:8080/a]b")
new URI(request.getRequestURL())

It will fail with a URI parsing exception. If I escape the path that will make the %20 double escaped.

How do I turn Servlet Request URLs into URIs?

Adam Gent
  • 47,843
  • 23
  • 153
  • 203

1 Answers1

7

Java's URI appears to be very strict and requires escaping for the Excluded US-ASCII Charset.

To fix this I encode those and only those characters minus the '%' and '#' as the URL may already contain those character. I used Http Clients URI utils which for some reason is not in HttpComponents.

private static BitSet badUriChars = new BitSet(256);
static {
    badUriChars.set(0, 255, true);
    badUriChars.andNot(org.apache.commons.httpclient.URI.unwise);
    badUriChars.andNot(org.apache.commons.httpclient.URI.space);
    badUriChars.andNot(org.apache.commons.httpclient.URI.control);
    badUriChars.set('<', false);
    badUriChars.set('>', false);
    badUriChars.set('"', false);
}

public static URI toURIorFail(String url) throws URISyntaxException {
    URI uri = URIUtil.encode(url, badUriChars, "UTF-8");
    return new URI(uri);
}

Edit: Here are some related SO posts (more to come):

Community
  • 1
  • 1
Adam Gent
  • 47,843
  • 23
  • 153
  • 203
  • `URI` is not so much as strict as correct. `URL` is somewhat less correct. – Tom Hawtin - tackline Jun 14 '12 at 22:22
  • I agree. IMHO it appears `URL` should not really be considered a true subset of `URI`. – Adam Gent Jun 14 '12 at 22:24
  • 2
    `java.net.URL` is just broken. – Tom Hawtin - tackline Jun 14 '12 at 22:40
  • I don't know/think its java.net.URL's fault for this one (although it is defiantly broken for other things). The reason is almost all user agents (browsers) are happy to send "[" or "]" and will not escape it but if you try to put a space in a URL the user agent will either auto escape it or fail. – Adam Gent Jun 15 '12 at 15:10
  • 5
    RFC2396 has been obsoleted by RFC3986. Square brackets are now allowed, but only to delimit IPv6 addresses. "A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax." http://www.rfc-editor.org/rfc/rfc3986.txt – Erick G. Hagstrom Jan 07 '14 at 17:59