13

I need to do some very quick-n-dirty input sanitizing and I would like to basically convert all <, > to &lt;, &gt;.

I'd like to achieve the same results as '<script></script>'.replace('<', '&lt;').replace('>', '&gt;') without having to iterate the string multiple times. I know about maketrans in conjunction with str.translate (ie. http://www.tutorialspoint.com/python/string_translate.htm) but this only converts from 1 char to another char. In other words, one cannot do something like:

inList = '<>'
outList = ['&lt;', '&gt;']
transform = maketrans(inList, outList)

Is there a builtin function that can do this conversion in a single iteration?

I'd like to use builtin capabilities as opposed to external modules. I already know about Bleach.

Nathan Davis
  • 5,636
  • 27
  • 39
notorious.no
  • 4,919
  • 3
  • 20
  • 34
  • Why not just iterate by hand? – Kevin Aug 17 '15 at 16:09
  • In that case it seems you actually want to particularly encode characters in HTML, please check http://stackoverflow.com/questions/701704/convert-html-entities-to-unicode-and-vice-versa – Nicolas78 Aug 17 '15 at 16:11
  • See https://stackoverflow.com/questions/6116978/python-replace-multiple-strings for multiple string replacement in general. – augurar Aug 17 '15 at 16:16

3 Answers3

14

Use html.escape() - cgi.escape() is deprecated in Python 3

import html
input = '<>&'
output = html.escape(input)
print(output)

&lt;&gt;&amp;
Michael Dubin
  • 168
  • 2
  • 6
12

You can use cgi.escape()

import cgi
inlist = '<>'
transform = cgi.escape(inlist)
print transform

Output:

&lt;&gt;

https://docs.python.org/2/library/cgi.html#cgi.escape

cgi.escape(s[, quote]) Convert the characters '&', '<' and '>' in string s to HTML-safe sequences. Use this if you need to display text that might contain such characters in HTML. If the optional flag quote is true, the quotation mark character (") is also translated; this helps for inclusion in an HTML attribute value delimited by double quotes, as in . Note that single quotes are never translated.

Joe Young
  • 5,749
  • 3
  • 28
  • 27
  • 2
    As mentioned in other comments, this method is deprecated since Python 3.2 (https://docs.python.org/3.7/library/cgi.html#cgi.escape). It suggests using html.escape. – Cheche Nov 27 '19 at 12:01
3

You can define your own function that loops over the string once and replaces any characters you define.

def sanitize(input_string):
    output_string = ''
    for i in input_string:
        if i == '>':
            outchar = '&gt;'
        elif i == '<':
            outchar = '&lt;'
        else:
            outchar = i
        output_string += outchar
    return output_string

Then calling

sanitize('<3 because I am > all of you')

yields

'&lt;3 because I am &gt; all of you'
FTA
  • 335
  • 1
  • 7
  • 3
    do have a look at string.join and list comprehensions! – Nicolas78 Aug 17 '15 at 16:21
  • 1
    Using + with strings is quadratic because it constructs a new string every time. I *think* CPython can optimize this into a linear operation, but other implementations like PyPy may not be able to. – Kevin Aug 17 '15 at 16:28
  • 3
    IMPORTANT: When rolling your own sanitzer, always use an explicit list. If any characters are NOT in the set of things you allow either a) raise an error or b) remove it or c) replace with a neutral character of some kind ... IE: `else if i in set(string.ascii_letters + string.ascii_digits): ... ` – Erik Aronesty Mar 08 '18 at 15:03