3

I work on a web application that uses Markdown as its syntax, the only issue I am facing is how to validate the user input on the server side so that it is actually Markdown and not some XSS attack that could be injected using a POST request or by disabling javascript.

I know StackOverflow does this but how do they do it and allow certain HTML tags including images that are prone to XSS attacks? Any open source package that can help (examples appreciated).

Becaue I heard that StackOverflow uses it, I will be trying out Pagedown as client side validator.

hakre
  • 193,403
  • 52
  • 435
  • 836
user115422
  • 4,662
  • 10
  • 26
  • 38
  • @zerkms that converts markdown to html, i need the inverse, i need it to check if the input is indeed Markdown... – user115422 Dec 22 '12 at 00:29
  • @fermionoid: Continue your research, you only just started. Follow the Stackover trail, Jeff has some info here: [Programming Is Hard, Let's Go Shopping!](http://www.codinghorror.com/blog/2008/10/programming-is-hard-lets-go-shopping.html) – hakre Dec 22 '12 at 00:33
  • @hakre are there any better markup languages? I've heard that BBCode isnt secure and i dont really feel like inventing my own :) – user115422 Dec 22 '12 at 00:34
  • @fermionoid: nothing is secure by definition. The opposite is true as well. Protection of a solution depends on implementation. – zerkms Dec 22 '12 at 00:35
  • @zerkms, ok, umm.. what do you think of quentin's answer? Is my interpretation that it means to strip tags except for the whitelist correct? – user115422 Dec 22 '12 at 00:37

2 Answers2

3

You need to invest ca. one to two weeks of proper coding and get some tagsoup parser / handler finsihed that can sanitze the incomming HTML (via Markdown).

I highly suggest a three pass validation and processing scheme:

  1. Mix-Mode: Whitelist incomming HTML tags that are part of the Markdown document.
  2. Markdown Parser: Transform Markdown into HMTL
  3. HTML-Mode: Whitelist HTML tags that are the HTML document.

You can then output. Store both, the Markdown source and the "backed" HTML data so you don't need to do this for every display operation.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • ok, so how will using htmlpurifier work to remove the illegal attributes etc? – user115422 Dec 22 '12 at 00:40
  • You could use Htmlpurifier for step 3.) . It is well in doing all that already IIRC. It checks for these "illegal" attributes and the etc. part. The source is open btw.. – hakre Dec 22 '12 at 00:42
  • right, so i have #1 as Pagedown right? I also have a package that converts the markdown to html then ill use strip_tags and htmlpurifier for #3. But is the list Quentin posted appropriate or are there any other tags I can add or remove from it? – user115422 Dec 22 '12 at 00:45
  • Well, I don't know Pagedown, but Pagedown probably follows a similar three step pattern as well - just on the client side. My list here was general, you can implement it on the client *and* on the sever. So on the server, step 1.) needs to be run as well, you can not just pretend that step 1.) has been run on client side. No, the whole process needs to be run, just in three steps, on each side. So #1 would already have removed all invalid HTML tags from the input *before* the non HTML markdown parts in #1 will be turned into HTML. #3 only takes care that #2 didn't let slip in exploits. – hakre Dec 22 '12 at 00:51
  • right then this bring me back to the issue of how to convert html into markup using a PHP script... – user115422 Dec 22 '12 at 00:53
  • @fermionoid: Maybe then this is a startingplace for you: http://stackoverflow.com/a/3577662/367456 – hakre Dec 22 '12 at 00:56
  • thanks, ill get started and try to come up with a working model, i'll get back to you to check if I'm doing it properly. Thanks! – user115422 Dec 22 '12 at 19:40
2

Markdown allows arbitrary HTML to be included in it. Since this includes <script> elements, you can have valid Markdown that is also an XSS attack.

Run the incoming data through a Markdown parser to get HTML, then treat it like any other user submitted HTML (pass it through an HTML parser that applies a whitelist to the elements and attributes).

Community
  • 1
  • 1
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335