java.net.URI non-standard encoding for Unicode character error

Question

The following code line:

    URI url = new URI("http://host?xyz=abc%u021B");

gives the error:

java.net.URISyntaxException: Malformed escape pair at index 19: http://host?xyz=abc%u021B

The reason is the presence of %u021B, a non-standard encoding for Unicode character

Is there a standard way to handle this?

This isn't a URI, and `java.net.URI` won't work with it directly. What result are you after? — Joe, Jun 11 '20 at 12:15
By definition, I don't think you can have a standard solution for a non-standard issue. — Simon G., Jun 11 '20 at 14:01
Perhaps "not a URI" is too broad, but it's not a URI in the [RFC 3986 sense](https://tools.ietf.org/html/rfc3986#appendix-A). Either transform it into one, or process it without using standard libraries. — Joe, Jun 11 '20 at 14:44

score 1 · Answer 1 · edited Oct 07 '21 at 11:19

Is there a standard way to handle this?

Following RFC 3986, this isn't a valid URI, and the correct behaviour would be to reject it.

The WHATWG living standard suggests a more robust behaviour of treating the characters literally:

Otherwise, if byte is 0x25 (%) and the next two bytes after byte in input are not in the ranges 0x30 (0) to 0x39 (9), 0x41 (A) to 0x46 (F), and 0x61 (a) to 0x66 (f), all inclusive, append byte to output.

As this doesn't apply, fall through and append the % as-is, meaning that:

%u021B

is treated the same as:

%25u021B

The %uxxxx encoding scheme was specified in draft-duerst-iri. If you wanted to implement it, pseudo-code would be:

Match on %u([a-f0-9]{4})
Parse the hex digits into a byte array b
Take new String(b, UTF_16BE).getBytes(UTF_8)
Append each byte in that result as %xx
Replace the original %uxxxx match

score 0 · Answer 2 · answered Jun 11 '20 at 14:31

0

You could convert "%u021B" -> "\\u021B"
And then convert it to unicode using apache.commons.lang3.StringEscapeUtils

Exemple:

String str = "http://host?xyz=abc%u021B";

str = str.replaceAll("%u", "\\\\u");
str = StringEscapeUtils.unescapeJava(str);

URI uri = new URI(str);
System.out.println("It works!");
System.out.println(str.toString());

answered Jun 11 '20 at 14:31

Roy

64
1
5

1

It's a good start. The only issue: %u021B needs to be replaced with the standard-encoded string. Right now it appears as it is decoded by `StringEscapeUtils.unescapeJava` – Jun 11 '20 at 14:37

score 0 · Answer 3 · 2020-06-11T15:16:38.603

Based on @Roy answer, this code works:

public static URI toUri(String uri) throws URISyntaxException {
    StringBuilder stringBuilder = new StringBuilder(uri);
    int index = stringBuilder.indexOf("%u");
    while (index > -1) {
        try {
            String substring = stringBuilder.substring(index, index + 6).replaceAll("%u", "\\\\u");
            String encoded = URLEncoder.encode(StringEscapeUtils.unescapeJava(substring), StandardCharsets.UTF_8);
            stringBuilder.replace(index, index + 6, encoded);
            index = stringBuilder.indexOf("%u", index + 6);
        } catch (Exception e) {
            throw new URISyntaxException(uri, e.getMessage());
        }
    }
    return new URI(stringBuilder.toString());
}

The idea is to replace every group %uxxxx with the encoded value of the unicode character \uxxxx.

This way http://host?xyz=abc%u021B becomes http://host?xyz=abc%C8%9B and the last one is a standard URI.

Note that `URLEncoder.encode` [isn't the right method in general](https://stackoverflow.com/questions/4737841/urlencoder-not-able-to-translate-space-character); [prefer Guava](https://stackoverflow.com/a/31595036/733345). — Joe, Jun 11 '20 at 15:16
@Joe I know, but the only issue is with space character. It doesn't bother me since space comes already standard-encoded. — , Jun 11 '20 at 15:19

java.net.URI non-standard encoding for Unicode character error

3 Answers3