How to parse URLs with leading unprintable control characters?

Question

I'm looking at a couple of URL parsers in Perl (Mojo::URL and URI) in the context of HTML::Restrict. The problem that I want to solve is that I'd like to be able to strip URLs in some cases. For instance, when filtering HTML I might want to allow relative URLs, but disallow JavaScript.

I've been presented with the following problem:

#!/usr/bin/env perl

use strict;
use warnings;
use feature qw( say );

use Mojo::URL ();

my $js_url = 'javascript:alert(1);';

my $mojo = Mojo::URL->new($js_url);
say 'scheme: ' . $mojo->scheme . " in $js_url";

for my $i ( 1 .. 8, 14 .. 31 ) {
    my $bad_url = "&#$i;" . $js_url;
    my $mojo    = Mojo::URL->new($bad_url);
    say $mojo->scheme ? 'scheme is ' . $mojo->scheme : 'no scheme found in ' . $bad_url;
}

This yields the following output:

scheme: javascript in javascript:alert(1);
no scheme found in &#1;javascript:alert(1);
no scheme found in &#2;javascript:alert(1);
no scheme found in &#3;javascript:alert(1);
no scheme found in &#4;javascript:alert(1);
no scheme found in &#5;javascript:alert(1);
no scheme found in &#6;javascript:alert(1);
no scheme found in &#7;javascript:alert(1);
no scheme found in &#8;javascript:alert(1);
no scheme found in &#14;javascript:alert(1);
no scheme found in &#15;javascript:alert(1);
no scheme found in &#16;javascript:alert(1);
no scheme found in &#17;javascript:alert(1);
no scheme found in &#18;javascript:alert(1);
no scheme found in &#19;javascript:alert(1);
no scheme found in &#20;javascript:alert(1);
no scheme found in &#21;javascript:alert(1);
no scheme found in &#22;javascript:alert(1);
no scheme found in &#23;javascript:alert(1);
no scheme found in &#24;javascript:alert(1);
no scheme found in &#25;javascript:alert(1);
no scheme found in &#26;javascript:alert(1);
no scheme found in &#27;javascript:alert(1);
no scheme found in &#28;javascript:alert(1);
no scheme found in &#29;javascript:alert(1);
no scheme found in &#30;javascript:alert(1);
no scheme found in &#31;javascript:alert(1);

In the above URLs, where the scheme is not found, I'm left to assume that it's a relative URL. However, if I use the above URLs in href tags, Chrome, Firefox and Safari all pop up a JavaScript alert box when clicked:

<a href="&#1;javascript:alert(1);">1</a>
<a href="&#2;javascript:alert(1);">2</a>
<a href="&#3;javascript:alert(1);">3</a>
<a href="&#4;javascript:alert(1);">4</a>
<a href="&#5;javascript:alert(1);">5</a>
<a href="&#6;javascript:alert(1);">6</a>
<a href="&#7;javascript:alert(1);">7</a>
<a href="&#8;javascript:alert(1);">8</a>
<a href="&#14;javascript:alert(1);">14</a>
<a href="&#15;javascript:alert(1);">15</a>
<a href="&#16;javascript:alert(1);">16</a>
<a href="&#17;javascript:alert(1);">17</a>
<a href="&#18;javascript:alert(1);">18</a>
<a href="&#19;javascript:alert(1);">19</a>
<a href="&#20;javascript:alert(1);">20</a>
<a href="&#21;javascript:alert(1);">21</a>
<a href="&#22;javascript:alert(1);">22</a>
<a href="&#23;javascript:alert(1);">23</a>
<a href="&#24;javascript:alert(1);">24</a>
<a href="&#25;javascript:alert(1);">25</a>
<a href="&#26;javascript:alert(1);">26</a>
<a href="&#27;javascript:alert(1);">27</a>
<a href="&#28;javascript:alert(1);">28</a>
<a href="&#29;javascript:alert(1);">29</a>
<a href="&#30;javascript:alert(1);">30</a>
<a href="&#31;javascript:alert(1);">31</a>

I've used Mojo::URL in the examples, but URI has the same behaviour. What I gather is that in both cases the parser does not strip the unprintable control character and, therefore, does not recognize that there is JavaScript in the URL. Web browsers (helpfully?) recognize that the control characters are not printable and allow the JavaScript in the URL to be executed on click.

What's going on here? Are the parsers and browsers both behaving correctly? Is it up to me to strip away unprintable control characters before parsing the URL?

At least https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid?rq=1 suggests that these are "invalid" URLs, and it seems that User Agents clean them up, unfortunately. For sanitizing/stripping HTML, I guess you should follow what the UAs implement and not what the spec says... — Corion, Feb 05 '19 at 15:26
Apart from that you are mixing up plain URL with URL encoded within an HTML context. The characters you add (i.e. ``) are not unprintable, but they get interpreted a HTML encoded (i.e. as one character `\x01`) inside a href since this is HTML context. They are taken verbatim (`` - 4 characters) if you just use these without explicit HTML decoding in Perl. — Steffen Ullrich, Feb 05 '19 at 16:49
@SteffenUllrich the main issue is that HTML::Restrict should be able to deal with any content (malicious or otherwise) provided by the user. In this case the HTML encoded content is able to defeat HTML::Restrict because URI doesn't find a scheme. What I'm after is the correct way and the correct place to deal with this. Maybe stripping HTML encoded chars from the URL before parsing it? https://github.com/oalders/html-restrict/issues/30 is the issue that started this. — oalders, Feb 05 '19 at 19:37
@SteffenUllrich I say unprintable because the range of problematic chars are ASCII control characters. Taken verbatim, yes, they are printable by Perl but the browser won't render them. — oalders, Feb 05 '19 at 19:40
@oalders: the browser will interpret HTML entities only inside the HTML context. If you put '' into a ` — Steffen Ullrich, Feb 05 '19 at 19:47
@oalders: as for HTML::Restrict: if in HTML context it should check the HTML decoded value for a valid syntax, i.e. decode the contents of the `href` attribute before checking. `\x01` (the HTML decoded ``) is not part of a valid URL (no matter if absolute or relative) and should thus be blocked. — Steffen Ullrich, Feb 05 '19 at 19:50
@SteffenUllrich would it perhaps be the job of the parser to strip the control chars? I'm looking at https://url.spec.whatwg.org/ and it says a parser should "remove any leading and trailing C0 control or space from input". I've opened a ticket for that here: https://github.com/libwww-perl/URI/issues/61 — oalders, Feb 05 '19 at 21:01
@oalders: The spec says before that *"If input contains any leading or trailing C0 control or space, validation error."*. Based on the context HTML::Restrict is used in I would argue that it should not rely on URI to return sanitized data but should treat such input as invalid and remove it. — Steffen Ullrich, Feb 06 '19 at 04:05

How to parse URLs with leading unprintable control characters?

0 Answers0