Python Input Sanitization

Question

I need to do some very quick-n-dirty input sanitizing and I would like to basically convert all <, > to <, >.

I'd like to achieve the same results as '<script></script>'.replace('<', '<').replace('>', '>') without having to iterate the string multiple times. I know about maketrans in conjunction with str.translate (ie. http://www.tutorialspoint.com/python/string_translate.htm) but this only converts from 1 char to another char. In other words, one cannot do something like:

inList = '<>'
outList = ['&lt;', '&gt;']
transform = maketrans(inList, outList)

Is there a builtin function that can do this conversion in a single iteration?

I'd like to use builtin capabilities as opposed to external modules. I already know about Bleach.

In that case it seems you actually want to particularly encode characters in HTML, please check http://stackoverflow.com/questions/701704/convert-html-entities-to-unicode-and-vice-versa — Nicolas78, Aug 17 '15 at 16:11
See https://stackoverflow.com/questions/6116978/python-replace-multiple-strings for multiple string replacement in general. — augurar, Aug 17 '15 at 16:16

score 14 · Accepted Answer · answered Oct 30 '19 at 15:59

14

Use html.escape() - cgi.escape() is deprecated in Python 3

import html
input = '<>&'
output = html.escape(input)
print(output)

&lt;&gt;&amp;

answered Oct 30 '19 at 15:59

Michael Dubin

168
2
6

score 12 · Answer 2 · answered Aug 17 '15 at 16:14

You can use cgi.escape()

import cgi
inlist = '<>'
transform = cgi.escape(inlist)
print transform

Output:

&lt;&gt;

https://docs.python.org/2/library/cgi.html#cgi.escape

cgi.escape(s[, quote]) Convert the characters '&', '<' and '>' in string s to HTML-safe sequences. Use this if you need to display text that might contain such characters in HTML. If the optional flag quote is true, the quotation mark character (") is also translated; this helps for inclusion in an HTML attribute value delimited by double quotes, as in . Note that single quotes are never translated.

As mentioned in other comments, this method is deprecated since Python 3.2 (https://docs.python.org/3.7/library/cgi.html#cgi.escape). It suggests using html.escape. — Cheche, Nov 27 '19 at 12:01

score 3 · Answer 3 · answered Aug 17 '15 at 16:14

3

You can define your own function that loops over the string once and replaces any characters you define.

def sanitize(input_string):
    output_string = ''
    for i in input_string:
        if i == '>':
            outchar = '&gt;'
        elif i == '<':
            outchar = '&lt;'
        else:
            outchar = i
        output_string += outchar
    return output_string

Then calling

sanitize('<3 because I am > all of you')

yields

'&lt;3 because I am &gt; all of you'

answered Aug 17 '15 at 16:14

FTA

335
1
7

3

do have a look at string.join and list comprehensions! – Nicolas78 Aug 17 '15 at 16:21
1

Using + with strings is quadratic because it constructs a new string every time. I *think* CPython can optimize this into a linear operation, but other implementations like PyPy may not be able to. – Kevin Aug 17 '15 at 16:28
3

IMPORTANT: When rolling your own sanitzer, always use an explicit list. If any characters are NOT in the set of things you allow either a) raise an error or b) remove it or c) replace with a neutral character of some kind ... IE: `else if i in set(string.ascii_letters + string.ascii_digits): ... ` – Erik Aronesty Mar 08 '18 at 15:03

Python Input Sanitization

3 Answers3

Linked