6

I'm coding a WYSIWYG editor width designMode="on" on a iframe. The editor works fine and i store the code as is in the database.

Before outputing the html i need to "clean" with php on the server-side to avoid cross-site-scripting and other scary things. Is there some sort of best practice on how to do this? What tags can be dangerous?

UPDATE: Typo fixed, it's What You See Is What You Get. Nothing new :)

Martin
  • 5,197
  • 11
  • 45
  • 60

4 Answers4

5

The best practice is to allow only certain things you know aren't dangerous, and remove/escape all the rest. See the paper Automated Malicious Code Detection and Removal on the Web (OWASP AntiSamy) for a discussion on this (the library is for Java, but the principles apply for any language).

Chris Lercher
  • 37,264
  • 20
  • 99
  • 131
  • I started out that way, but since all browsers implement this stuff differently i will get a lot of tags for the same thing that i need to allow. For example bold text is done in at least 3 different ways. So it will be a huge set of regex. It's also possible to paste in whatever formatted html you want in the editor, like from a html-mail or something. And that looks good in the editor but won't work after escape. – Martin May 05 '10 at 14:38
  • 1
    That's why AntiSamy already comes with some example sets. Probably, there's also a PHP library (or you can create one?) You will *never* achieve it the other way around (by blacklisting): Everyone who tried this before, has failed - it's simply not realistically possible - there *will* be something you haven't covered (which is fatal for blacklisting, but doesn't matter too much when whitelisting). Ideally, if you can avoid HTML, use Markdown etc., as suggested by Hank! – Chris Lercher May 05 '10 at 14:41
  • 1
    @Martin you *REALLY* shouldn't be using regexes for this. There's a reason [this answer](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) got (net) 3000 upvotes. – Hank Gay May 05 '10 at 15:56
  • Okay i'm convinced now that i should do whitelisting instead of blacklisting. @Hank Gay: But i'm not really going to parse html, i'm just going to replace < with < and then replace < back to < on a small set of known patterns. Is that still like going to a date with satan? – Martin May 06 '10 at 07:24
3

If you're really bent on allowing this, you should use a white list approach.

The best approach is probably to disallow HTML and use a simplified markup format instead; you can pre-render to HTML and store that in the database if performance is a concern. Avoiding these sorts of problems is one of the big reasons for using Markdown, Textile, reStructuredText, etc.

NOTE: I linked to GitHub-Flavored Markdown (GFM), not Standard Markdown (SM). GFM addresses some common problems that end-users have with SM.

Hank Gay
  • 70,339
  • 36
  • 160
  • 222
1

I looked into the same question recently with Perl as the server-side language.

While doing so I ran into HTML Purifier which may be what you want. But obviously as it's in PHP and not Perl, I didn't actually test it out.

Also, in my research I came to the conclusion that this is a very tricky business and consider if possible using a simplified markup language like Markdown, as suggested by Hank Gay.

FalseVinylShrub
  • 1,213
  • 9
  • 10
0

If you are familiar with ASP .NET, just perform a Server.htmlencode() to convert special characters like < > to "& g t;" "&l t ;"

In php, you can use htmlspecialchars() functions.

Once the special characters are encoded, cross-site-scripting can be prevented.

TechTravelThink
  • 3,014
  • 3
  • 20
  • 13
  • But that disables html, i want to allow html but remove dangerous tags like iframe and script. – Martin May 05 '10 at 14:34
  • Then use a markup specifically designed for the prupose like bbcode or wikicode and a suitable editor. – symcbean May 05 '10 at 16:39