Uri and WebView classes parsing URLs containing backslashes in authority (host or user information) differently

Question

When using the URIs

String myUri = "https://evil.example.com\\.good.example.org/";
// or
String myUri = "https://evil.example.com\\@good.example.org/";

in Java on Android, the backslash in the host or user information of the authority part of the URI causes a mismatch between how Android’s android.net.Uri and android.webkit.WebView parse the URI with regard to its host.

The Uri class (and cURL) treat evil.example.com\.good.example.org (first example) or even good.example.org (second example) as the URI’s host.
The WebView class (and Firefox and Chrome) treat evil.example.com (both examples) as the URI’s host.

Is this known, expected or correct behavior? Do the two classes simply follow different standards?

Looking at the specification, it seems neither RFC 2396 nor RFC 3986 allows for a backslash in the user information or authority.

Is there any workaround to ensure a consistent behavior here, especially for validation purposes? Does the following patch look reasonable (to be used with WebView and for general correctness)?

Uri myParsedUri = Uri.parse(myUri);

if ((myParsedUri.getHost() == null || !myParsedUri.getHost().contains("\\")) && (myParsedUri.getUserInfo() == null || !myParsedUri.getUserInfo().contains("\\"))) {
    // valid URI
}
else {
    // invalid URI
}

One possible flaw is that this workaround may not catch all the cases that cause inconsistent hosts to be parsed. Do you know of anything else (apart from a backslash) that causes a mismatch between the two classes?

David · Answer 1 · 2018-06-14T14:55:52.753

It's known that Android WebView 4.4 converts some URLs, in the linked issue are some steps described how to prevent that. From your question is not completely clear if your need is based in that issue or something else.

You can mask the backslashes and other signs with there according number in the character-table. In URLs the the number is written in hexademcimal.

Hexadecimal: 5C
Dezimal: 92
Sign: \

The code is the prepended with a % for each sign in the URL, your code looks like this after replacement:

String myUri = "https://evil.example.com%5C%5C.good.example.org/";
// or
String myUri = "https://evil.example.com%5C%5C@good.example.org/";

it might be required still to add a slash to separate domain and path:

String myUri = "https://evil.example.com/%5C%5C.good.example.org/";
// or
String myUri = "https://evil.example.com/%5C%5C@good.example.org/";

Is it possible that the backslashes never shall be used for network-communication at all but serve as escaping for some procedures like regular expressions or for output in JavaScript (Json) or some other steps?

Bonus ;-)
Below is a php-script that prints a table for most UTF-8-signs with the corresponding Numbers in hex and dec. (it still should be wrapped in an html-template including css perhaps):

<?php
    $chs = array('0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F');
    $chs2 = $chs;
    $chs3 = $chs;
    $chs4 = $chs;
    foreach ($chs as $ch){
        foreach ($chs2 as $ch2){    
            foreach ($chs3 as $ch3){
                foreach ($chs4 as $ch4){
                    echo '<tr>';
                    echo '<td>';
                    echo $ch.$ch2.$ch3.$ch4;
                    echo '</td>';
                    echo '<td>';
                    echo hexdec($ch.$ch2.$ch3.$ch4);
                    echo '</td>';
                    echo '<td>';
                    echo '&#x'.$ch.$ch2.$ch3.$ch4.';';
                    echo '</td>';
                    echo '</tr>';
                }
            }
        }
    }
?>

Thanks, but the question was not at all how to properly encode the backslash (which simply requires the well-known percent-encoding), but how to fix the validation of URLs containing the backslash in Java on Android. Further, I did *not* say that some RFCs allow for the backslash to be used literally. Instead, I said that *neither* of the two relevant RFCs allow that. — caw, Jun 14 '18 at 13:10
sorry, I never read carefully enough concerning your statement related to the RFCs. I corrected that part. About the rest I think then it would really depending on a special server-setup. Maybe you still can outline the special use-case, it's not clear for me why it should be desired or required at all — David, Jun 14 '18 at 13:48
It’s not about those changes in Android 4.4. This is irrelevant to this issue. The same behavior can be reproduced on Android 5 or later. The question is not about any special use case, either, but about a difference in behavior between the `WebView` class (and thus the Chrome implementation backing those `WebView` instances) and the `Uri` class. That’s all. If you need a use case: An attacker could supply malicious URLs to you, e.g. using an `Intent` if you allow for that, or via crafted URLs if your app is set to open certain URL patterns. You may want to determine the host of the URL here. — caw, Jun 14 '18 at 15:37
Question is what need is for those URLs inside the classes or why they are built like that. So my question was more related to the internal structure of the classes or to a special server that allows backslashes for a special reason (i.e. google play). Also the only use-case to combine two URLs is for some webservices like searches (i.e. whois), referrer as parameter or resembling. but then masking is used or only the pure domain without any slashes used. As I neither use android.net nor android-webview the whole problem might be a bit strange for me. — David, Jun 14 '18 at 15:59

Bertram Gilfoyle · Answer 2 · 2018-06-17T07:33:59.173

Is this known, expected or correct behavior?

IMO, it is not. For both URI and WebView. Because RFC won't allow a backslash, they could have warn it. However it is less important because it does not affect the working at all if the input is as expected.

Do the two classes simply follow different standards?

The URI class and WebView strictly follows the same standards. But due to the fact that they are different implementations, they may behave differently to an unexpected input.

For example, "^(([^:/?#]+):)?((//([^/?#]*))?([^?#]*)(\\?([^#]*))?)?(#(.*))?" this is the regular expression in URI which is used to parse URIs. The URI parsing of WebView is done by native CPP methods. Even though they follow same standards, chances are there for them to give different outcome (At least for unexpected inputs).

Does the following patch look reasonable?

Not really (See the answer of next question).

Do you know of anything else (apart from a backslash) that causes a mismatch between the two classes?

Because you are so concerned about the consistent behavior, I won't suggest a manual validation. Even the programmers who wrote these classes can't list all of such scenarios.

The solution

If I understand correctly, you need to load URLs which is supplied by untrustable external sources (which attackers can exploit if there is a loop hole), but you need to identify it's host correctly.

In that case, you can parse it using URI class itself and use URI#getHost() to identify the host. But for WebView, instead of passing the original URL string, pass URI#toString().

Thanks! I would disagree and say that a backslash is very much *expected* input – more specifically, it’s expected and invalid. Both classes should validate and fail. Obviously, neither of us can change anything about that. For `WebView`, they just seem to have chosen to do what most standalone browsers do, i.e. rewriting the input instead of failing on validation. For `Uri`, they seem to do this for performance reasons, as the documentation says. So one can understand why they chose not to validate, but there should be a way for the developer to validate then. — caw, Jun 18 '18 at 23:51
Parsing a URL and calling `getHost` on that instance is what I have always done. This is the entire point of this question: The result of that `getHost` method is not consistent with what `WebView` does, so it’s useless and something else is needed. By the way, you seem to have conflated `android.net.Uri` and `java.net.URI`. The former is what this question is about. The latter is what you may be referring to when consistently writing `URI`. — caw, Jun 18 '18 at 23:54
Your solution of using `Uri.parse(myUri).toString()` or `new URI(myUri).toString()` instead of simply `myUri` and passing that to `WebView` does not work at all. The former just returns the same value and is therefore useless. The latter does not return anything but fails with `java.net.URISyntaxException: Illegal character in authority` – which is good, but not what your solution stated, and neither the class that this question has been about. — caw, Jun 18 '18 at 23:58

Uri and WebView classes parsing URLs containing backslashes in authority (host or user information) differently

2 Answers2

The solution