What happens here?
As stated by @yelliver the webserver seems to use NFD encoded unicode in it's path names. So the solution is to use the same encoding as well.
Is the webserver doing correct?
1. For those who are curious (like me), this article on Multilingual Web Addresses brings some light into the subject. In the section on IRI pathes (the part that is actually handled by the webserver), it states:
Whereas the domain registration authorities can all agree to accept domain names in a particular form and encoding (ASCII-based punycode), multi-script path names identify resources located on many kinds of platforms, whose file systems do and will continue to use many different encodings. This makes the path much more difficult to handle than the domain name.
2. More on the subject on how to encode pathes can be found at Section
5.3.2.2. at the IETF Proposed Standard on Internationalized Resource Identifiers (IRIs)
rfc3987. It says:
Equivalence of IRIs MUST rely on the assumption that IRIs are
appropriately pre-character-normalized rather than apply character
normalization when comparing two IRIs. The exceptions are conversion
from a non-digital form, and conversion from a non-UCS-based
character encoding to a UCS-based character encoding. In these cases,
NFC or a normalizing transcoder using NFC MUST be used for
interoperability. To avoid false negatives and problems with
transcoding, IRIs SHOULD be created by using NFC. Using NFKC may
avoid even more problems; for example, by choosing half-width Latin
letters instead of full-width ones, and full-width instead of
half-width Katakana.
3. Unicode Consortium states:
NFKC is the preferred form for identifiers, especially where there are security concerns (see UTR #36). NFD and NFKD are most useful for internal processing.
Conclusion
The webserver mentioned in the question does not conform with the recommendations of the IRI standard or the unicode consortium and uses NFD encoding instead of NFC or NFKC.
One way to correctly encode an URL-String is as follows
URI uri = new URI(url.getProtocol(), url.getUserInfo(), IDN.toASCII(url.getHost()), url.getPort(), url.getPath(), url.getQuery(), url.getRef());
Then convert that Uri to ASCII string:
String correctEncodedURL=uri.toASCIIString();
The toASCIIString()
calls encode()
which uses NFC encoded unicode. IDN.toASCII()
converts the host name to Punycode.