177

cgi.escape seems like one possible choice. Does it work well? Is there something that is considered better?

Josh Gibson
  • 21,808
  • 28
  • 67
  • 63

9 Answers9

202

html.escape is the correct answer now, it used to be cgi.escape in python before 3.2. It escapes:

  • < to &lt;
  • > to &gt;
  • & to &amp;

That is enough for all HTML.

EDIT: If you have non-ascii chars you also want to escape, for inclusion in another encoded document that uses a different encoding, like Craig says, just use:

data.encode('ascii', 'xmlcharrefreplace')

Don't forget to decode data to unicode first, using whatever encoding it was encoded.

However in my experience that kind of encoding is useless if you just work with unicode all the time from start. Just encode at the end to the encoding specified in the document header (utf-8 for maximum compatibility).

Example:

>>> cgi.escape(u'<a>bá</a>').encode('ascii', 'xmlcharrefreplace')
'&lt;a&gt;b&#225;&lt;/a&gt;

Also worth of note (thanks Greg) is the extra quote parameter cgi.escape takes. With it set to True, cgi.escape also escapes double quote chars (") so you can use the resulting value in a XML/HTML attribute.

EDIT: Note that cgi.escape has been deprecated in Python 3.2 in favor of html.escape, which does the same except that quote defaults to True.

nosklo
  • 217,122
  • 57
  • 293
  • 297
  • 7
    The additional boolean parameter to cgi.escape should also be considered for escaping quotes when text is used in HTML attribute values. – Greg Hewgill Jun 30 '09 at 04:20
  • Just to be sure: If I run all untrusted data through the `cgi.escape` function, is enough to protect against all (known) XSS attacs? – Tomas Sedovic Feb 11 '10 at 21:41
  • @Tomas Sedovic: Depends on where you'll put the text after running cgi.escape in it. If placed in root HTML context then yes, you're completely safe. – nosklo Feb 12 '10 at 03:00
  • What about input like {{Measures 12 Ω"H x 17 5/8"W x 8 7/8"D. Imported.}} That's not ascii, so encode() will throw an exception at you. – Andrew Kolesnikov Jun 22 '10 at 15:56
  • @Andrew Kolesnikov: Have you tried it? `cgi.escape(yourunicodeobj).encode('ascii', 'xmlcharrefreplace') == '{{Measures 12 Ω"H x 17 5/8"W x 8 7/8"D. Imported.}}'` -- as you can see, the expression returns ascii bytestring, with all non-ascii unicode chars encoded using the xml character reference table. – nosklo Jun 23 '10 at 03:48
  • Actually it would seem you need to do cgi.escape(yourunicode).decode('utf-8').encode('ascii', 'xmlcharrefreplace'), otherwise the ascii codec doesn't know how to handle the Ω. – Adrian Ghizaru Jan 05 '12 at 17:19
  • @AdrianGhizaru well, no. First of all, you're trying to .decode `yourunicode`, which, since you claim it is unicode, would be **already decoded**. That would invoke *implicity ascii encoding*, or fail directly, depending on python version. If you just use the example provided at the end of the answer `cgi.escape(u'Ω').encode('ascii', 'xmlcharrefreplace')` it will already work. So I guess that if you got an error, then `yourunicode` is not really unicode, you'd need to decode it first to get unicode. – nosklo Jan 10 '12 at 01:35
  • how to ecape several whitespaces and linebraks? – user937284 Feb 19 '15 at 17:26
  • Can you explain the decode/encode process in full? I get the error message "UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 76: invalid start byte" when I try `text.decode('utf-8').encode('ascii', 'xmlcharrefreplace') `.. trying decode('Unicode') throws unknown encoding: Unicode – 576i Aug 30 '15 at 10:41
  • For anyone landing here trying to avoid XSS vulnerabilities, simple cgi escaping is definitely NOT enough to protect you, contrary to some earlier/ancient comments. See [this tricky case](https://www.owasp.org/index.php/XSS_Filter_Evasion_Cheat_Sheet#Escaping_JavaScript_escapes) for example. Note that `html.escape` will protect you from that vector, though (since it gets both quote types)... provided you use it everywhere. – Russ Mar 31 '18 at 03:12
  • NOTE: `cgi.escape` was removed in Python 3.8. – 0x5453 Feb 18 '22 at 14:50
171

In Python 3.2 a new html module was introduced, which is used for escaping reserved characters from HTML markup.

It has one function escape():

>>> import html
>>> html.escape('x > 2 && x < 7 single quote: \' double quote: "')
'x &gt; 2 &amp;&amp; x &lt; 7 single quote: &#x27; double quote: &quot;'
Flimm
  • 136,138
  • 45
  • 251
  • 267
Maciej Ziarko
  • 11,494
  • 13
  • 48
  • 69
  • What about `quote=True`? – 2rs2ts Nov 14 '13 at 23:49
  • 3
    @SalmanAbbas Are you afraid that quotes aren't escaped? Note that `html.escape()` does escape quotes, by default (in contrast, `cgi.quote()` does not - and only escapes double quotes, if told so). Thus, I have to explicitly set an optional parameter to inject something into an attribute with `html.escape()`, i.e. to make it insecure for attributes: `t = '" onclick="alert()'; t = html.escape(t, quote=False); s = f'foo'` – maxschlepzig Apr 26 '19 at 09:32
  • @maxschlepzig I think Salman is saying `escape()` is not enough to make attributes safe. In other words, this is not safe: `` – pianoJames Jul 30 '19 at 15:49
  • @pianoJames, I see. I consider checking link values a domain specific semantic validation. Not a lexical one like escaping. Besides inline Java Script, you really don't want to create links from untrusted user input without further URL specific validation (e.g. because of Spammers). A simple method to protect against inline Java Script in attributes like `href` is to set a Content Security Policy that disallows it. – maxschlepzig Jul 31 '19 at 19:41
  • @pianoJames It is safe, because `html.escape` does escape single quotes and double quotes. – Flimm May 07 '20 at 19:16
12

If you wish to escape HTML in a URL:

This is probably NOT what the OP wanted (the question doesn't clearly indicate in which context the escaping is meant to be used), but Python's native library urllib has a method to escape HTML entities that need to be included in a URL safely.

The following is an example:

#!/usr/bin/python
from urllib import quote

x = '+<>^&'
print quote(x) # prints '%2B%3C%3E%5E%26'

Find docs here

vallentin
  • 23,478
  • 6
  • 59
  • 81
SuperFamousGuy
  • 1,455
  • 11
  • 16
  • 11
    This is the wrong kind of escaping; we're looking for [HTML escapes](http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references), as opposed to [URL encoding](http://en.wikipedia.org/wiki/URL_Encoding). – Chaosphere2112 Sep 12 '13 at 21:56
  • 8
    Nontheless - it was what I was actually looking for ;-) – Brad Jan 16 '15 at 13:38
  • In Python 3, this has been moved to urllib.parse.quote. https://docs.python.org/3/library/urllib.parse.html#url-quoting – Mark Peschel Nov 29 '21 at 20:34
9

There is also the excellent markupsafe package.

>>> from markupsafe import Markup, escape
>>> escape("<script>alert(document.cookie);</script>")
Markup(u'&lt;script&gt;alert(document.cookie);&lt;/script&gt;')

The markupsafe package is well engineered, and probably the most versatile and Pythonic way to go about escaping, IMHO, because:

  1. the return (Markup) is a class derived from unicode (i.e. isinstance(escape('str'), unicode) == True
  2. it properly handles unicode input
  3. it works in Python (2.6, 2.7, 3.3, and pypy)
  4. it respects custom methods of objects (i.e. objects with a __html__ property) and template overloads (__html_format__).
Brian M. Hunt
  • 81,008
  • 74
  • 230
  • 343
7

cgi.escape should be good to escape HTML in the limited sense of escaping the HTML tags and character entities.

But you might have to also consider encoding issues: if the HTML you want to quote has non-ASCII characters in a particular encoding, then you would also have to take care that you represent those sensibly when quoting. Perhaps you could convert them to entities. Otherwise you should ensure that the correct encoding translations are done between the "source" HTML and the page it's embedded in, to avoid corrupting the non-ASCII characters.

Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
6

No libraries, pure python, safely escapes text into html text:

text.replace('&', '&amp;').replace('>', '&gt;').replace('<', '&lt;'
        ).replace('\'','&#39;').replace('"','&#34;').encode('ascii', 'xmlcharrefreplace')
Arhacker T
  • 13
  • 1
  • 1
  • 4
speedplane
  • 15,673
  • 16
  • 86
  • 138
2

Not the easiest way, but still straightforward. The main difference from cgi.escape module - it still will work properly if you already have &amp; in your text. As you see from comments to it:

  • cgi.escape version
def escape(s, quote=None):
    '''Replace special characters "&", "<" and ">" to HTML-safe sequences.
    If the optional flag quote is true, the quotation mark character (")
    is also translated.'''
    s = s.replace("&", "&amp;") # Must be done first!
    s = s.replace("<", "&lt;")
    s = s.replace(">", "&gt;")
    if quote:
        s = s.replace('"', "&quot;")
    return s
  • regex version
QUOTE_PATTERN = r"""([&<>"'])(?!(amp|lt|gt|quot|#39);)"""
def escape(word):
    """
    Replaces special characters <>&"' to HTML-safe sequences. 
    With attention to already escaped characters.
    """
    replace_with = {
        '<': '&lt;',
        '>': '&gt;',
        '&': '&amp;',
        '"': '&quot;', # should be escaped in attributes
        "'": '&#39'    # should be escaped in attributes
    }
    quote_pattern = re.compile(QUOTE_PATTERN)
    return re.sub(quote_pattern, lambda x: replace_with[x.group(0)], word)
Zhymabek Roman
  • 35
  • 2
  • 11
palestamp
  • 111
  • 1
  • 3
  • 6
1

cgi.escape extended

This version improves cgi.escape. It also preserves whitespace and newlines. Returns a unicode string.

def escape_html(text):
    """escape strings for display in HTML"""
    return cgi.escape(text, quote=True).\
           replace(u'\n', u'<br />').\
           replace(u'\t', u'&emsp;').\
           replace(u'  ', u' &nbsp;')

for example

>>> escape_html('<foo>\nfoo\t"bar"')
u'&lt;foo&gt;<br />foo&emsp;&quot;bar&quot;'
JamesThomasMoon
  • 6,169
  • 7
  • 37
  • 63
1

For legacy code in Python 2.7, can do it via BeautifulSoup4:

>>> bs4.dammit import EntitySubstitution
>>> esub = EntitySubstitution()
>>> esub.substitute_html("r&d")
'r&amp;d'
scharfmn
  • 3,561
  • 7
  • 38
  • 53