0

The following code line:

    URI url = new URI("http://host?xyz=abc%u021B");

gives the error:

java.net.URISyntaxException: Malformed escape pair at index 19: http://host?xyz=abc%u021B

The reason is the presence of %u021B, a non-standard encoding for Unicode character

Is there a standard way to handle this?

  • This isn't a URI, and `java.net.URI` won't work with it directly. What result are you after? – Joe Jun 11 '20 at 12:15
  • 1
    @Joe Why do you think it is not an URI? –  Jun 11 '20 at 12:24
  • By definition, I don't think you can have a standard solution for a non-standard issue. – Simon G. Jun 11 '20 at 14:01
  • Perhaps "not a URI" is too broad, but it's not a URI in the [RFC 3986 sense](https://tools.ietf.org/html/rfc3986#appendix-A). Either transform it into one, or process it without using standard libraries. – Joe Jun 11 '20 at 14:44

3 Answers3

1

Is there a standard way to handle this?

Following RFC 3986, this isn't a valid URI, and the correct behaviour would be to reject it.

The WHATWG living standard suggests a more robust behaviour of treating the characters literally:

Otherwise, if byte is 0x25 (%) and the next two bytes after byte in input are not in the ranges 0x30 (0) to 0x39 (9), 0x41 (A) to 0x46 (F), and 0x61 (a) to 0x66 (f), all inclusive, append byte to output.

As this doesn't apply, fall through and append the % as-is, meaning that:

%u021B

is treated the same as:

%25u021B

The %uxxxx encoding scheme was specified in draft-duerst-iri. If you wanted to implement it, pseudo-code would be:

  1. Match on %u([a-f0-9]{4})
  2. Parse the hex digits into a byte array b
  3. Take new String(b, UTF_16BE).getBytes(UTF_8)
  4. Append each byte in that result as %xx
  5. Replace the original %uxxxx match
Community
  • 1
  • 1
Joe
  • 29,416
  • 12
  • 68
  • 88
0

You could convert "%u021B" -> "\\u021B"
And then convert it to unicode using apache.commons.lang3.StringEscapeUtils

Exemple:

String str = "http://host?xyz=abc%u021B";

str = str.replaceAll("%u", "\\\\u");
str = StringEscapeUtils.unescapeJava(str);

URI uri = new URI(str);
System.out.println("It works!");
System.out.println(str.toString());
Roy
  • 64
  • 1
  • 5
  • 1
    It's a good start. The only issue: %u021B needs to be replaced with the standard-encoded string. Right now it appears as it is decoded by `StringEscapeUtils.unescapeJava` –  Jun 11 '20 at 14:37
0

Based on @Roy answer, this code works:

public static URI toUri(String uri) throws URISyntaxException {
    StringBuilder stringBuilder = new StringBuilder(uri);
    int index = stringBuilder.indexOf("%u");
    while (index > -1) {
        try {
            String substring = stringBuilder.substring(index, index + 6).replaceAll("%u", "\\\\u");
            String encoded = URLEncoder.encode(StringEscapeUtils.unescapeJava(substring), StandardCharsets.UTF_8);
            stringBuilder.replace(index, index + 6, encoded);
            index = stringBuilder.indexOf("%u", index + 6);
        } catch (Exception e) {
            throw new URISyntaxException(uri, e.getMessage());
        }
    }
    return new URI(stringBuilder.toString());
}

The idea is to replace every group %uxxxx with the encoded value of the unicode character \uxxxx.

This way http://host?xyz=abc%u021B becomes http://host?xyz=abc%C8%9B and the last one is a standard URI.

  • Note that `URLEncoder.encode` [isn't the right method in general](https://stackoverflow.com/questions/4737841/urlencoder-not-able-to-translate-space-character); [prefer Guava](https://stackoverflow.com/a/31595036/733345). – Joe Jun 11 '20 at 15:16
  • 1
    @Joe I know, but the only issue is with space character. It doesn't bother me since space comes already standard-encoded. –  Jun 11 '20 at 15:19