5

I'm playing with a simple script to escape certain HTML characters, and am encountering a bug which seems to be caused by the order of elements in my list escape_pairs. I'm not modifying the lists during a loop, so I can't think of any Python/programming principles I'm overlooking here.

escape_pairs = [(">", "&gt;"),("<","&lt;"),('"',"&quot;"),("&","&amp;")]

def escape_html(s):
    for (i,o) in escape_pairs:
        s = s.replace(i,o)
    return s

print escape_html(">")
print escape_html("<")
print escape_html('"')
print escape_html("&")

returns

&amp;gt;
&amp;lt;
&amp;quot;
&amp;

However when I switch the order of the elements in my escape_pairs list to the bug disappears

>>> escape_pairsMod = [("&","&amp;"),("<","&lt;"),('"',"&quot;"),(">", "&gt;")]

&gt;
&lt;
&quot;
&amp;
Community
  • 1
  • 1
dylankb
  • 1,140
  • 10
  • 14
  • Yes. If you write out the value of `s` on a piece of paper and follow the steps your program is taking, you will see it happen. – jtbandes Aug 24 '15 at 05:16
  • In your first "buggy" case, how is `replace` supposed to know that you only want to replace *some* of the `&` characters and not all of them? – DSM Aug 24 '15 at 05:19

2 Answers2

2

Yes, in your first implementation, it can.

Lets take the case of > and the list -

escape_pairs = [(">", "&gt;"),("<","&lt;"),('"',"&quot;"),("&","&amp;")]

When iterating through escape_pairs , you first get > and replace it with &gt; . This causes the string to become '&gt; . Then you keep on iterating, and at the end you find ("&","&amp;") , and you replace the & in the string with &amp; , making the result the one you get right now.

When you change the order of the lists, you get the correct result. But still this is just because you first took into consideration & and only after that you took other in consideration.

You can use str.translate instead to translate the string coorectly , according to a dictionary. Example -

>>> escape_pairs = [(">", "&gt;"),("<","&lt;"),('"',"&quot;"),("&","&amp;")]
>>> escape_dict = dict(escape_pairs)
>>> t = str.maketrans(escape_dict)
>>> ">".translate(t)
'&gt;'
>>> "> & <".translate(t)
'&gt; &amp; &lt;'

But if what you want to do is HTML escape the string, then you should use the standard library - cgi -

>>> import cgi
>>> cgi.escape("< > &")
'&lt; &gt; &amp;'

Also, if you are using Python 3.2 + , you can use html.escape instead, Example -

>>> import html
>>> html.escape("< > &")
'&lt; &gt; &amp;'
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • 2
    According to the docs, html.escape should be used instead of cgi. Otherwise a great answer – erlc Aug 24 '15 at 05:56
  • 1
    I had first recommended that, but since `html` module is only Python 3.2 + , I changed to `cgi` , added it back with note that its only Python 3.2 + . – Anand S Kumar Aug 24 '15 at 05:58
  • I'd heard of the `.translate()` function but never understood what the use case would be. Guess I have one now :) – dylankb Aug 24 '15 at 13:45
  • It is worth noting though that the use of `str.translate` in your answer only works in Python 3. – dylankb Aug 25 '15 at 06:53
  • @AnandSKumar, did you have a suggestion for how to use an alternative to `str.translate` in Python 2.7? – dylankb Sep 16 '15 at 20:12
1

I will use the first time you call your escape_html function as an example:
print escape_html(">")

Problem:
When you s.replace(i,o) the first time:

s = ">"

s = s.replace(i,o)

">".replace(">", "&gt;")

s = "&gt;"

But now when you get to the the last replace(), the value of s was saved from before so:

s = "&gt;"

s = s.replace(i,o)

"&gt;".replace("&","&amp;") #replaces the "&" in `"&gt;"` with `"&amp;"` 

s = "&amp;gt;"


Why Does Order Matter?
The reason this depends on order is because when the .replace("&","&amp;") comes first it'll be:

s = ">"

s = s.replace(i,o)

">".replace("&","&amp;") #No "&"'s to replace so:

s = ">" 

Then your program goes on to work as expected.

Solution:
Because there will always be just one instance of the change you are trying to make in the list, just return once you have made that change.

def escape_html(s):
    for (i,o) in escape_pairs:
        s = s.replace(i,o)
        return s
ThatGuyRussell
  • 1,361
  • 9
  • 18
  • 1
    I appreciate the focus on the simplest fix, even if it's not the most comprehensive. – dylankb Aug 24 '15 at 14:32
  • @insighter Thanks for the appreciation! I wanted to provide a solution that fell comfortably into what you were already doing so you could quickly understand it and move forward. If this helped you, a vote up would be nice so others know it worked! – ThatGuyRussell Aug 24 '15 at 17:21