How to prevent bleach from escaping > (blockquote) tag used in Markdown

Question

I'm using bleach to sanitize user input. But I use Markdown which means I need the blockquote > symbol to go through without being escaped as & gt; so I can pass it to misaka for rendering.

The documentation says by default it escapes html markup but doesn't say how to turn that off for the > symbol. I would still like it to escape actual html tags.

http://bleach.readthedocs.org/en/latest/clean.html

Any other ideas for sanitizing input while maintaing the ability to use Markdown would be appreciated.

score 2 · Answer 1 · edited May 23 '17 at 11:52

2

Bleach is a HTML sanitizer, not a Markdown sanitizer. As explained here, you should run your user input through Markdown first, then through Bleach. Like this:

sanitized_html = bleach.clean(markdown.markdown(some_text))

For more info, see my previously referenced comment.

edited May 23 '17 at 11:52

Community

1
1

answered Feb 21 '14 at 17:12

Waylan

37,164
12
83
109

most people [here](http://stackoverflow.com/questions/1266650/should-i-sanitize-markdown) say you need to sanitize markdown before you save it in the database. like the OP in that thread, i have a field in the database for both the original markdown and the generated html. – aris Feb 22 '14 at 01:23
@aris that is not how I read those comments. In fact, they state that you should sanitize "before sending to web clients". I don't see anyone specifically saying you should sanitize before saving to the DB (even though that was the OP's question). Either way, Bleach is not a Markdown sanitizer, so it is the wrong tool for the job. – Waylan Mar 14 '14 at 00:17

score 0 · Accepted Answer · edited Mar 16 '15 at 20:29

0

Do you need strip all tags, but leave > as it is?

strip all tags, get output
html decode output of step 1, and pass that data to misaka

Simple way for step 2:

output.replace('>', '>')

More professional

import HTMLParser
h = HTMLParser.HTMLParser()
s = h.unescape(sanitized user input)

edited Mar 16 '15 at 20:29

Taymon

24,950
9
62
84

answered Feb 21 '14 at 08:00

pinkdawn

1,023
11
20

thanks! the simple way works fine as a quick solution but it does match all > symbols, including at the end of html tags. is it possible to get it to match just the > occuring after a new line? i wasn't able to do replace('\n>','>') – aris Feb 21 '14 at 08:32

How to prevent bleach from escaping > (blockquote) tag used in Markdown

2 Answers2