24

I already know how XSS works, but finding out all the many different ways to inject malicious input is not an option.

I saw a couple libraries out there, but most of them are very incomplete, ineficient, or GPL licensed (when will you guys learn that GPL is not good to share little libraries! Use MIT)

HappyDeveloper
  • 12,480
  • 22
  • 82
  • 117
  • 6
    Maybe you could list the libraries you've already considered, so we don't waste our time with all of those incomplete, inefficient, or improperly licensed solutions? – grossvogel Oct 20 '10 at 02:08
  • 2
    Using a library will fix many XSS problems. If your application is complex, it won't get them all. If your application is a worthwhile target, someone will eventually break it. You ABSOLUTELY MUST learn how XSS works and understand it in great detail in order to write a secure application. Even if you use a library. – Paul McMillan Oct 20 '10 at 03:03
  • 11
    All “anti-XSS” libraries are incomplete by nature, as they are trying to apply heuristics to work out what input might be harmful when handled incorrectly at the output stage, but running at the input stage with no idea of what the output stage entails. Whilst there is a trade-off between how obvious an exploit you let through (false negative) and how badly you mangle real user input (false positive), you will always have ‘false’ because the task is inherently impossible. Anti-XSS is utterly bogus. You must fix your output to encode as necessary for the context. – bobince Oct 20 '10 at 03:08
  • +1 for bobince's comment – alex Oct 20 '10 at 04:05
  • Library? I think you mean function call. – rook Oct 20 '10 at 04:28
  • This is an old thread, but I would like to show a good reference for output encoding for other people searching for solutions: https://www.owasp.org/index.php/XSS_%28Cross_Site_Scripting%29_Prevention_Cheat_Sheet This webpage shows that you must use different encoding methods according to context. HTML, HTML attribute, CSS or Javascript, each of those situations demand a different kind of escaping. – pedromanoel Sep 22 '14 at 14:33

7 Answers7

19

OWASP offers an encoding library, on which time has been spent to handle the various cases.

Obsolete: http://www.owasp.org/index.php/Category:OWASP_Encoding_Project

Now at http://code.google.com/p/reform/
and OWASP's antiXSS specific library is at: http://code.google.com/p/php-antixss/

Community
  • 1
  • 1
atk
  • 9,244
  • 3
  • 32
  • 32
7

htmlspecialchars() is the only function you should know about.

zerkms
  • 249,484
  • 69
  • 436
  • 539
  • 16
    Unfortunately, that's not enough. If you html encode characters used in JavaScript, you'll have bad data in your JS. Same for characters placed in URLs. Also, there's use cases where the function won't prevent XSS, such as tag attributes without encapsulating single- or double-quotes (since whitespace is not encoded by htmlspecialchars) – atk Oct 20 '10 at 02:18
  • 1
    @zerkms: IIRC, JS requires \xx where xx is the hex code of the byte. URLs require %xx, again where xx is hex. A good JS example of badly encoded data would be alert("&"), and a good example of URLs would be www.example.com/foo?a=b&c=d In the first case, if you want to alert the ampersand char, you'll alert the string & In the second, you'll have CGI args of a=b and amp;c=d (assuming ; isn't treated as a special char in the URL scheme - I don't remember if it is or not, off the top of my head) . True, you won't have XSS, but your functionality won't work, either. – atk Oct 20 '10 at 02:25
  • 7
    Sure, you need the right form of encoding for your output context. That's most often `htmlspecialchars()` for HTML, but could be `rawurlencode()`, `json_encode()`, `mysql_real_escape_string()`, whatever. The main point is, this depends on the output stage and is *not* something that can be handled on the input using “anti-XSS” measures. – bobince Oct 20 '10 at 03:05
  • You need ENT_QUOTES when encoding something that is going in an attribute, e.g.``. Without it, the attacker can enter `string" onclick="[bad code]` (note careful positioning of quotes) and cause bad script to be run when someone tries to edit the input. – rjmunro Apr 13 '11 at 10:40
  • 1
    @rjmunro: double quotes are already encoded by default. What the `ENT_QUOTES` flag does is also to encode single quotes, which it's valid to use as attribute value delimiters too (although very few people actually do). It's the safer thing to do, but it's not actually necessary unless you've got single-quoted attributes. – bobince May 29 '11 at 22:20
4

HTMLPurifier is the undenied best option for cleansing HTML input, and htmlspecialchars should be applied to anything else.

But XSS vulnerabilities should not be cleaned out, because any such submissions are garbage anyway. Rather make your application bail and write a log entry. The best filter set to achieve XSS detection is in the mod_security core rules.

I'm using an inconspicious but quite thorough attribute detection here in new input(), see _xss method.

mario
  • 144,265
  • 20
  • 237
  • 291
3

Edit: Thank you @mario for pointing that it all depends on the context. There really is no super way to prevent it all on all occasions. You have to adjust accordingly.


Edit: I stand corrected and very appreciative for both @bobince and @Rook's support on this issue. It's pretty much clear to me now that strip_tags will not prevent XSS attacks in any way.

I've scanned all my code prior to answering to see if I was in any way exposed and all is good because of the htmlentities($a, ENT_QUOTES) I've been using mainly to cope with W3C.

That said I've updated the function bellow to somewhat mimic the one I use. I still find strip_tags nice to have before htmlentities so that when a user does try to enter tags they will not pollute the final outcome. Say user entered: <b>ok!</b> it's much nicer to show it as ok! than printing out the full text htmlentities converted.

Thank you both very much for taking the time to reply and explain.


If it's coming from internet user:

// the text should not carry tags in the first place
function clean_up($text) {
    return htmlentities(strip_tags($text), ENT_QUOTES, 'UTF-8');
}

If it's coming from the backoffice... don't.

There are perfectly valid reasons why someone at the company may need javascript for this or that page. It's much better to be able to log and blame than to shut down your uers.

Frankie
  • 24,627
  • 10
  • 79
  • 121
  • 3
    `strip_tags` is not a security measure. This allows all sorts of XSS badness through, such as `
    `. There's almost never a good reason to use `strip_tags`.
    – bobince Oct 20 '10 at 03:03
  • @bobince, you're perfectly correct. I should have revised my function before copy-pasting it. `strip_tags` is pretty efective in removing **ALL XSS** as long as you strip them all out. – Frankie Oct 20 '10 at 03:07
  • 1
    -1 because xss can still get past this. strip_tags() is garbage. The correct answer is `htmlspecialchars($var,ENT_QUOTES);` – rook Oct 20 '10 at 05:34
  • 1
    @Frankie but you don't need tags to exploit xss. http://stackoverflow.com/questions/3762746/todays-xss-onmouseover-exploit-on-twitter-com – rook Oct 20 '10 at 05:52
  • @Rook, @bobince I've updated the question to reflect your comments. Thank you again for taking the time to reply. – Frankie Oct 20 '10 at 11:58
  • 2
    These comments are somewhat misleading. `strip_tags` does strip *all* HTML tags out. It therefore is a valid help against raw html injection. `htmlspecialchars` **and** `urlencode` is required *in addition* if received data is to be put verbatim into tag/attribute context. But that's the crux, **it all depends on the context**. `htmlspecialchars` alone is of no help if the target context is RSS for example, because `<script>` would result in an XSS exploit over there. – mario Oct 20 '10 at 16:29
  • @mario actually browsers automatically do a htmldecode on (some?) requests. Try posting an htmlencoded quote marks and greater than and less than symbols. Also you can use `htmlspecialchars($var,ENT_QUOTES);` to stop all xss, except for *some cases* when it is already in a ` – rook Oct 20 '10 at 18:20
  • @Frankie yep that is the proper method for stopping xss, i gave you a +1. SO is great for learning tricky shit like this isn't it? – rook Oct 20 '10 at 18:22
  • @Rook, what I meant in that particular case (Twitter), an urlencode would have been the better fix. Any double quote gets turned into an %22 or single quote into a %27, or angle brackets into %3C, %3E. Which way you encode input data is obviously irrelevant to browsers in most cases, if they transfuse raw data onto the next URL. That's why I think strip_tags is not useless per se. | Also I fear the original questioner now went away without knowing about `ENT_QUOTES` that you pointed out, without which htmlspecialchars isn't that useful. – mario Oct 20 '10 at 18:36
  • @mario Your right twitter was writing a url to the page. Also i think your right about the OP, oah well. He was severally misinformed because he was looking for a "library" to do this, talk about overkill. – rook Oct 20 '10 at 18:53
  • @Rook SO is just amazing. The way we can interact, explore, share and "suck less"... is just close to perfection. Thank you once more! – Frankie Oct 20 '10 at 21:50
1

I like htmlpurifier fine, but I see how it could be inefficient, since it's fairly large. Also, it's LGPL, and I don't know if that falls under your GPL ban.

grossvogel
  • 6,694
  • 1
  • 25
  • 36
1

In addition to zerkms's answer, if you find you need to accept user submitted HTML (from a WYSIWYG editor, for example), you will need to use a HTML parser to determine what can and can't be submitted.

I use and recommend HTML Purifier.

Note: Don't even try to use regex :)

Community
  • 1
  • 1
alex
  • 479,566
  • 201
  • 878
  • 984
  • we just had been hacked, a team of security consultant, fixed our apps with regex all over the place, so it's definetely a standard in the industry to use regex – ninja Oct 05 '17 at 07:03
  • @ninja Using regex is fine, but to use to parse HTML for security is not a good idea. – alex Oct 09 '17 at 07:32
  • what do you mean ? it's the only way to check the nature of the data beeing sent from the client...of course unless you use magic or something ... – ninja Nov 27 '17 at 13:25
0

I'm surprised it's not been mentioned here, but I prefer htmlAwed to htmlPurifier. It's up-to-date, nicely licensed, very small and really fast.

Synchro
  • 35,538
  • 15
  • 81
  • 104