I'm looking at a couple of URL parsers in Perl (Mojo::URL and URI) in the context of HTML::Restrict. The problem that I want to solve is that I'd like to be able to strip URLs in some cases. For instance, when filtering HTML I might want to allow relative URLs, but disallow JavaScript.
I've been presented with the following problem:
#!/usr/bin/env perl
use strict;
use warnings;
use feature qw( say );
use Mojo::URL ();
my $js_url = 'javascript:alert(1);';
my $mojo = Mojo::URL->new($js_url);
say 'scheme: ' . $mojo->scheme . " in $js_url";
for my $i ( 1 .. 8, 14 .. 31 ) {
my $bad_url = "&#$i;" . $js_url;
my $mojo = Mojo::URL->new($bad_url);
say $mojo->scheme ? 'scheme is ' . $mojo->scheme : 'no scheme found in ' . $bad_url;
}
This yields the following output:
scheme: javascript in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
no scheme found in javascript:alert(1);
In the above URLs, where the scheme is not found, I'm left to assume that it's a relative URL. However, if I use the above URLs in href
tags, Chrome, Firefox and Safari all pop up a JavaScript alert box when clicked:
<a href="javascript:alert(1);">1</a>
<a href="javascript:alert(1);">2</a>
<a href="javascript:alert(1);">3</a>
<a href="javascript:alert(1);">4</a>
<a href="javascript:alert(1);">5</a>
<a href="javascript:alert(1);">6</a>
<a href="javascript:alert(1);">7</a>
<a href="javascript:alert(1);">8</a>
<a href="javascript:alert(1);">14</a>
<a href="javascript:alert(1);">15</a>
<a href="javascript:alert(1);">16</a>
<a href="javascript:alert(1);">17</a>
<a href="javascript:alert(1);">18</a>
<a href="javascript:alert(1);">19</a>
<a href="javascript:alert(1);">20</a>
<a href="javascript:alert(1);">21</a>
<a href="javascript:alert(1);">22</a>
<a href="javascript:alert(1);">23</a>
<a href="javascript:alert(1);">24</a>
<a href="javascript:alert(1);">25</a>
<a href="javascript:alert(1);">26</a>
<a href="javascript:alert(1);">27</a>
<a href="javascript:alert(1);">28</a>
<a href="javascript:alert(1);">29</a>
<a href="javascript:alert(1);">30</a>
<a href="javascript:alert(1);">31</a>
I've used Mojo::URL
in the examples, but URI
has the same behaviour. What I gather is that in both cases the parser does not strip the unprintable control character and, therefore, does not recognize that there is JavaScript in the URL. Web browsers (helpfully?) recognize that the control characters are not printable and allow the JavaScript in the URL to be executed on click.
What's going on here? Are the parsers and browsers both behaving correctly? Is it up to me to strip away unprintable control characters before parsing the URL?