Mitigate xss attacks when building links

Question

I posted this question a while back and it is working great for finding and 'linkifying' links from user generated posts. Linkify Regex Function PHP Daring Fireball Method

   <?php
if (!function_exists("html")) {
function html($string){
    return htmlspecialchars($string, ENT_QUOTES, 'UTF-8');
}
}

if ( false === function_exists('linkify') ):   
  function linkify($str) {
$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
    $input = $matches[0];
    $url = $matches[2] == 'http' ? $input : "http://$input";
    return '<a href="' . $url . '" rel="nofollow" target="_blank">' . "$input</a>";
}, $str); 
}
endif;

echo "<div>" . linkify(html($row_rsgetpost['userinput'])) . "</div>";

?>

I am concerned that I may be introducing a security risk by inserting user generated content into a link. I am already escaping user content coming from my database with htmlspecialchars($string, ENT_QUOTES, 'UTF-8') before running it through the linkify function and echoing back to the page, but I've read on OWASP that link attributes need to be treated specially to mitigate XSS. I am thinking this function is ok since it places the user-generated content inside double quotes and has already been escaped with htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), but would really appreciate someone with xss expertise to confirm this. Thanks!

If untrusted input is meant to be placed into href, src or other URL-based attributes, it should be validated to make sure it does not point to an unexpected protocol, especially Javascript links. URL's should then be encoded based on the context of display like any other piece of data. For example, user driven URL's in HREF links should be attribute encoded. Example given is in Java. Not sure how to implement in PHP... http://code.google.com/p/owasp-esapi-java/source/browse/trunk/src/main/java/org/owasp/esapi/codecs/PercentCodec.java — Jeff, Apr 25 '12 at 16:19
@Jeff, your code is actually quite insecure, because htmlspecialchars will not remove any tags. So somebody could add ``. It may be hard to do anything within that script, because no quotes are allowed, but I would not rely on that. Use `htmlentities()` instead. — d_inevitable, Apr 26 '12 at 18:13
Ok, now I'm thoroughly confused. I was under the impression that `htmlspecialchars` was just as effective as `htmlentities` in neutralizing 'dangerous' characters like < > etc. yet doesn't encode other harmless characters into their entities. — Jeff, Apr 26 '12 at 18:29
http://stackoverflow.com/questions/46483/htmlentities-vs-htmlspecialchars — Jeff, Apr 26 '12 at 18:31
@Jeff ok, sorry. I thought that `htmlspecialchars` doesn't escape `<>`. But I was wrong. `htmlspecialchars` is indeed a better choice. — d_inevitable, Apr 26 '12 at 18:36
Ok, great. Thanks for confirming. Hopefully we can get @Rook 's buy in on this and put this to rest! :-) — Jeff, Apr 26 '12 at 18:46
Jeff although @Rook really does some very experienced in security aspects and most likely knows much more than i do in that subject, he is so full of it that I think he may not realize that he is missing something here. In a way I understand him though, there are many cases on SO where answers with severe security vulnerabilities get accepted over much better answers and that can be very frustrating. He is only pulling out his anger at the wrong post imo. — d_inevitable, Apr 26 '12 at 21:24
Thanks for your help and patience with a novice @d_inevitable. I accepted your answer based on the discussion and explanation around the issue. Just for clarity for someone else looking at this solution, I am using the code now in my question which is based on your original code, plus today's suggestions-- but somewhat different than the code in your answer. — Jeff, Apr 26 '12 at 21:36

score 1 · Answer 1 · edited May 23 '17 at 12:12

First of data must NEVER be escaped before entering the database, this is very serious mistake. This is not only insecure, but it breaks functionality. Chaining the values of strings, is data corruption and affects string comparison. This approach is insecure because XSS is an output problem. When you are inserting data into the database you do not know where it appears on the page. For instance, even if you where this function the following code is still vulnerable to XSS:

For example:

<a href="javascript:alert(1)" \>

In terms of your regular expression. My initial reaction was, well this is a horrible idea. No comments on how its supposed to work and heavy use of NOT operators, blacklisting is always worse than white-listing.

So I loaded up Regex Buddy and in about 3 minutes I bypassed your regex with this input:

https://test.com/test'onclick='alert(1);//

No developer wants to write a vulnerably, so they are caused with a breakdown in how programmer thinks his application is working, and how it actually works. In this case i would assume you never tested this regex, and its a gross oversimplification of the problem.

HTMLPurifer is a php library designed to clean HTML, it consist of THOUSANDS of regular expressions. Its very slow, and is bypassed on a fairly regular basis. So if you go this route, make sure to update regularly.

In terms of fixing this flaw i think your best off using htmlspecialchars($string, ENT_QUOTES, 'UTF-8'), and then enforcing that the string start with 'http'. HTML encoding is a form of escaping, and the value will be automatically decoded such that the URL is unmolested.

score 1 · Answer 2 · answered Apr 26 '12 at 05:52

1

Because the data is going into an attribute, it should be url (or percent) encoded:

return '<a href="' . urlencode($url) . '" rel="nofollow" target="_blank">' . "$input</a>";

Technically it should also then be html encoded

return '<a href="' . htmlspecialchars(urlencode($url)) . '" rel="nofollow" target="_blank">' . "$input</a>";

but no browsers I know of care and consequently no-one does it, and it sounds like you might be doing this step already and you don't want to do this twice

answered Apr 26 '12 at 05:52

tobyodavies

27,347
5
42
57

1

Neither of these approaches produces valid http links. This approach is self defeating. – rook Apr 26 '12 at 06:05
Thanks @tobyodavies. I think you're on the right track, and I tried that earlier. Unfortunately it breaks the 'linkify' function because the urlencoded url no longer matches with the Regex that determines the link in the preg_replace_callback function. Unfortunately, it is a bit over my head... not sure if it's possible. – Jeff Apr 26 '12 at 06:09
@Rook what possible invalid URL can escape URL encoding? it may end up nonsensical, but it will always be legal and can never be a JS link or have any side effects on the DOM) – tobyodavies Apr 26 '12 at 07:19

score 0 · Accepted Answer · edited Apr 29 '12 at 23:12

0

Your regular expression is looking for urls that are of http or https. That expression seems to be relatively safe as in does not detect anything that is not a url.

The XSS vulnerability comes from the escaping of the url as html argument. That means making sure that the url cannot prematurely escape the url string and then add extra attributes to the html tag that @Rook has been mentioning.

So I cannot really think of a way how an XSS attack could be performed the following code as suggested by @tobyodavies, but without urlencode, which does something else:

$pattern = '(?xi)\b((?:(http)s?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))';
return preg_replace_callback("#$pattern#i", function($matches) {
    $input = $matches[0];
    $url = $matches[2] == 'http' ? $input : "http://$input";
    return '<a href="' . htmlspecialchars($url) . '" rel="nofollow" target="_blank">' . "$input</a>";
}, $str);

Note that I have also a added a small shortcut for checking the http prefix.

Now the anchor links that you generate are safe.

However you should also sanitize the rest of the text. I suppose that you don't want to allow any html at all and display all the html as clear text.

edited Apr 29 '12 at 23:12

rook

66,304
38
162
239

answered Apr 26 '12 at 13:55

d_inevitable

4,381
2
29
48

Thanks @d_inevitable. I was hoping you would take a look at this since the original linkify function is yours. Unfortunately when I try the above (after fixing typo on $mathes --> $matches) the linkify function doesn't work anymore. When I click on entered text of 'www.google.com' for example, the link goes to https://www.mysite.com/directory/http%3A%2F%2Fwww.google.com. Is this because the original $pattern regex is no longer able to match against the urlencoded characters? Do you think the original 'linkify' function is sufficient since the user-generated content is inside the double quotes? – Jeff Apr 26 '12 at 15:13
Yes the original is sufficient, but less efficient. Did you also change the pattern like i did? Brackets around http: `(http)`. – d_inevitable Apr 26 '12 at 15:16
Actually I found the issue. It's with url encode that shouldn't take the protocol. Will put in a edit to make it work. – d_inevitable Apr 26 '12 at 15:20
Yes, I tried both the old and new, but unfortunately both broke the linkify function when I tried using urlencode. – Jeff Apr 26 '12 at 15:21
1

Okay, `urlencode` is not relevant. I was miss-lead into using it, but it is only encode url query string parameters, which is not your intention. So it's safe to take it out. The htmlspecialchars, will do the job. – d_inevitable Apr 26 '12 at 15:37

score 0 · Answer 4 · answered Jan 05 '14 at 23:38

Firstly, as the PHP documentation states htmlspecialchars only escapes " '&' (ampersand) becomes '&' '"' (double quote) becomes '"' when ENT_NOQUOTES is not set. "'" (single quote) becomes ''' (or ') only when ENT_QUOTES is set. '<' (less than) becomes '<' '>' (greater than) becomes '>' ". javascript: is still used in regular programming, so why : isn't escaped is beyond me.

Secondly, if !html only expects the characters you think will be entered, not the representation of those characters that can be entered and are seen as valid. the utf-8 character set, and every other character set supports multiple representations for the same character. Also, your false statement allows 0-9 and a-z, so you still have to worry about base64 characters. I'd call your code a good attempt, but it needs a ton of refining. That or you could just use htmlpurifier, which people can still bypass. I do think that it is awesome that you set the character set in htmlspecialchars, since most programmers don't understand why they should do that.

Mitigate xss attacks when building links

4 Answers4

Linked