30

I need to let users enter Markdown content to my web app, which has a Python back end. I don’t want to needlessly restrict their entries (e.g. by not allowing any HTML, which goes against the spirit and spec of Markdown), but obviously I need to prevent cross-site scripting (XSS) attacks.

I can’t be the first one with this problem, but didn’t see any SO questions with all the keywords “python,” “Markdown,” and “XSS”, so here goes.

What’s a best-practice way to process Markdown and prevent XSS attacks using Python libraries? (Bonus points for supporting PHP Markdown Extra syntax.)

Alan H.
  • 16,219
  • 17
  • 80
  • 113
  • "Python back end"? What does this mean, exactly? If you're supporting markdown, all HTML can be quoted with `
    `.
    – S.Lott Mar 10 '11 at 21:39
  • 1
    You could test your app against the XSS cheat sheet at http://ha.ckers.org/xss.html – jfs Mar 10 '11 at 22:17
  • @S.Lott: Meaning the server-side scripting is in Python. `
    ` isn’t exactly the solution. Markdown is what we use here on SO to write comments and questions… and the only time it results in a `
    ` block is when you specifically request a code block (by indenting).
    – Alan H. Mar 10 '11 at 22:59
  • @Alan H. "server-side scripting is in Python"? What does this mean, exactly? How would you allow markdown and somehow, magically, allow some non-escaped HTML? All the frameworks I know of will trivially "escape" the HTML in the content to prevent XSS problems. Since you're asking, you must not be using a framework. Which leads to the questions "Python back end"? "server-side scripting is in Python"? I have no idea what you're talking about, and without details it's very difficult to provide any kind of response. – S.Lott Mar 11 '11 at 00:03
  • 4
    @S.Lott No worries, I’m asking this question to people who already know about how Markdown works and what a back-end is. – Alan H. Mar 11 '11 at 01:22
  • @Alan H.: Please define "Python back end" or "server-side scripting is in Python" by providing some name of a software product or API. – S.Lott Mar 11 '11 at 03:03
  • 7
    @S.Lott I decline. I want this question to be fairly general and not bound to e.g. Django, App Engine, or Zope, etc. (I assume you don’t want me to *define* “server-side” or “Python back end”, but rather *clarify* which framework I may be using. After all, if you needed those defined, surely you wouldn’t know the answer.) – Alan H. Mar 11 '11 at 05:18
  • @Alan H.: You decline to be specific on the backend. Since the answer depends on what exact quoting facilities the mysterious "backend" has, it devolves to a guessing game. You could, think about actually **updating** the question to actually say that. There are some people who might be able to help but aren't the kind of genius that can decode a question which is intentionally vague. – S.Lott Mar 11 '11 at 10:59

2 Answers2

21

I was unable to determine “best practice,” but generally you have three choices when accepting Markdown input:

  1. Allow HTML within Markdown content (this is how Markdown originally/officially works, but if treated naïvely, this can invite XSS attacks).

  2. Just treat any HTML as plain text, essentially letting your Markdown processor escape the user’s input. Thus <small>…</small> in input will not create small text but rather the literal text “<small>…</small>”.

  3. Throw out all HTML tags within Markdown. This is pretty user-hostile and may choke on text like <3 depending on implementation. This is the approach taken here on Stack Overflow.

My question regards case #1, specifically.

Given that, what worked well for me is sending user input through

  1. Markdown for Python, which optionally supports Extra syntax and then through
  2. html5lib’s sanitizer.

I threw a bunch of XSS attack attempts at this combination, and all failed (hurray!); but using benign tags like <strong> worked flawlessly.

This way, you are in effect going with option #1 (as desired) except for potentially dangerous or malformed HTML snippets, which are treated as in option #2.

(Thanks to Y.H Wong for pointing me in the direction of that Markdown library!)

Joe
  • 6,497
  • 4
  • 29
  • 55
Alan H.
  • 16,219
  • 17
  • 80
  • 113
2

Markdown in Python is probably what you are looking for. It seems to cover a lot of your requested extensions too.

To prevent XSS attacks, the preferred way to do it is exactly the same as other languages - you escape the user output when rendered back. I just took a peek at the documentation and the source code. Markdown seems to be able to do it right out of the box with some trivial config tweaks.

Y.H Wong
  • 7,151
  • 3
  • 33
  • 35
  • 1
    That’s not right… escaping the output would tags showing up on the page. – Alan H. Mar 10 '11 at 22:57
  • "Markdown in Python" sure does look like what I want, though. Thanks for that, much appreciated. – Alan H. Mar 10 '11 at 23:26
  • "escaping the output would tags showing up on the page". What else can possibly happen? Either you escaping the HTML with `
    ` tags makes the HTML into simple, formatted text, defeating any scripting going on.  Or you allow XSS.  There's very little middle ground of "allowing" HTML and magically preventing XSS.
    – S.Lott Mar 11 '11 at 00:04
  • 7
    S.Lott, your angry ignorance is baffling. (1): I will probably go with [something like html5lib](http://code.google.com/p/html5lib/) which does exactly that. It’s called sanitizing. Have you never seen e.g. a blog which allowed HTML comments but only certain tags and few or no attributes? It’s certainly possible. (2): `
    ` tags don’t actually escape anything or prevent XSS attacks, you realize, right? Consider `
    ` as input: *pwned,* as they say.
    – Alan H. Mar 11 '11 at 01:18
  • No shouting children. By escaping the output, I mean only directly inserted HTML code (html in markdown markup) would be escaped. The actual output that should be HTML (markdown's target), will still be in HTML. Just try it out and u'll see. – Y.H Wong Mar 11 '11 at 02:07
  • BTW, you should probably look for whether Markdown for Python let's u define a list of safe HTML tags. – Y.H Wong Mar 11 '11 at 02:11
  • Thanks, Y.H. Markdown for Python does support what you are suggesting: `safe_mode="escape"`, per the docs, would turn tags in the source into escaped, visible characters in the output. This is, however, not how Markdown originally worked (and works); you should be able to use `` and `` and other tags if you want to. Using a sanitizer, you can, “safely”. – Alan H. Mar 11 '11 at 05:10
  • Or you can run the Markdown output thru a sanitizer before sticking that into the final HTML output. It's up to you. – Y.H Wong Mar 14 '11 at 04:11