7
System.out.println(
    new URI("http", "example.com", "/servlet", "a=x%20y", null));

The result is http://example.com/servlet?a=x%2520y, where the query parameter value differs from the supplied one. Strange, but this does follow the Javadoc:

"The percent character ('%') is always quoted by these constructors."

We can pass the decoded string, a=x y and then we get a reasonable(?) result a=x%20y.

But what if the query parameter value contains an "&" character? This happens for example if the value is an URL itself with query parameters. Look at this (wrong) query string: a=b&c. The ampersand must be escaped here (a=b%26c), otherwise this can be considered as a query parameter a=b and some garbage (c). If I pass this to an URI constructor, it encodes it, and returns a wrong URL: ...?a=b%2526c

This issue seems to render java.util.URI useless. Am I missing something here?

Summary of answers

java.net.URI does know about the existence of the query part of an URI, but it does not understand the internals of the query part, which can differ for each scheme. For example java.net.URI does not understand the internal structure of the HTTP query part. This would not be a problem, if java.net.URI considered query as an opaque string, and did not alter it. But it tries to apply some generic percent-encoding algorithm, which breaks HTTP URLs.

Therefore I cannot use the URI class to reliably assemble an URL from its parts, despite there are constructors for it. I would also mention that as of Java 7, the implementation of the relativize operation is quite limited, only works if one URL is the prefix of another one. These two functionality (and its leaner interface for these purposes) were the reason why I was interested in java.net.URI, but neither of them works for me.

At the end I used java.net.URL for parsing, and wrote code to assemble an URL from parts and to relativize two URLs. I also checked the Apache HttpClient URIBuilder class, and although it does understand the internals of an HTTP query string, but as of 4.3, it has the same problem with encoding like java.net.URI when dealing with the query part as a whole.

Hontvári Levente
  • 1,242
  • 10
  • 16

4 Answers4

1

The query string

a=b&c

is not wrong in a URI. The RFC on URI Generic Syntax states

The query component is a string of information to be interpreted by the resource.

  query         = *uric

Within a query component, the characters ";", "/", "?", ":", "@",
"&", "=", "+", ",", and "$" are reserved.

The character & in the query string is very much valid (uric represents reserved, mark, and alphanumeric characters). The RFC also states

Many URI include components consisting of or delimited by, certain
special characters. These characters are called "reserved", since
their usage within the URI component is limited to their reserved
purpose. If the data for a URI component would conflict with the
reserved purpose, then the conflicting data must be escaped before
forming the URI.

Because the & is valid but reserved, it is up to the user to determine if it is meant to be encoded or not.

What you call a query parameter is not a feature of a URI and therefore the URI class has no reason to (and shouldn't) support it.

Related:

Community
  • 1
  • 1
Sotirios Delimanolis
  • 274,122
  • 60
  • 696
  • 724
  • Yes, `a=b&c` is syntactically valid, but it does not mean what is obviously intended: a query parameter named `a` with a value `b&c`. The ampersand must be escaped, but than URI returns a messed up URL in toString(). Lets see a more realistic example, we pass a relative URL `mypage?hello=world` in the `return` parameter. The full, valid URL is: `http://example.com/some?return=mypage%3Fhello%3Dworld`. What should I pass in the java.net.URI multi-argument constructors, to get back this full URL? – Hontvári Levente Nov 11 '13 at 23:10
  • @HontváriJózsefLevente Query parameters are relevant in an HTTP context. But URI is not only relevant in an HTTP context. Query parameters are interpreted by an HTTP server. In a URI they mean nothing and you'll therefore not be able to do any special formatting with the `URI` class. – Sotirios Delimanolis Nov 11 '13 at 23:41
  • It is not necessary for java.net.URI to understand the internals of the query part. For example it would be enough if its multi-argument constructors don't alter the perfectly valid query string I pass to them. – Hontvári Levente Nov 12 '13 at 00:22
  • @HontváriJózsefLevente Which perfectly valid query string did you pass to it and it changed it? `a=x%20y` is not a valid query string. Note that the RFC states `Under normal circumstances, the only time when octets within a URI are percent-encoded is during the process of producing the URI from its component parts`. So the `a=x%20y` becomes `a=x%2520y`. The javadoc states that, _aside from some minor deviations_, a `java.net.URI` instance represents a URI reference. – Sotirios Delimanolis Nov 12 '13 at 00:33
  • 1
    `&` can be both a separator within the query component and a data character. In the latter case it must be percent-encoded. Because URI does not understand the internals of the query component, it cannot decide if the ampersand is a separator or data character. Therefore, as you wrote, it is up to the user, i.e. my code, to decide which. Now if I percent-encode ampersands, which are not separators but data characters, then URI.toString() returns a bad string. I still do not know what should I pass to the URI multi-argument constructors to get back the example URL I wrote above. – Hontvári Levente Nov 12 '13 at 01:25
  • @HontváriJózsefLevente It must only be encoded in the context of an HTTP request. The `URI` class doesn't know in which context you want to use it so it doesn't encode it, because that is not its job. You cannot use the `URI` constructor to do what you want. – Sotirios Delimanolis Nov 12 '13 at 01:27
1

The only workaround I found was to use the single-argument constructors and methods. Note that you must use URI#getRawQuery() to avoid decoding %26. For example:

URI uri = new URI("http://a/?b=c%26d&e");
// uri.getRawQuery() equals "b=c%26d&e"

uri = new URI(new URI(uri.getScheme(), uri.getAuthority(),
        uri.getPath(), null, null) + "?f=g%26h&i");
// uri.getRawQuery() equals "f=g%26h&i"

uri = uri.resolve("?j=k%26l&m");
// uri.getRawQuery() equals "j=k%26l&m"
// uri.toString() equals "http://a/?j=k%26l&m"
Matthew
  • 2,024
  • 15
  • 19
0

Single working solution known for me is reflection (see https://blog.stackhunter.com/2014/03/31/encode-special-characters-java-net-uri/)

URI uri = new URI("http", null, "example.com", -1, "/accounts", null, null);
Field field = URI.class.getDeclaredField("query");
field.setAccessible(true);
field.set(uri, encodedQueryString);
//clear cached string representation
field = URI.class.getDeclaredField("string");
field.setAccessible(true);
field.set(uri, null);
-1

Use URLEncoder.encode() method, in your case for example:

URLEncoder.encode("a=x%20y", "ISO-8859-1");
Eel Lee
  • 3,513
  • 2
  • 31
  • 49