2

I want to escape the unescaped data inside a xml string e.g.

string = "<tag parameter = "something">I want to escape these >, < and &</tag>"

to

"<tag parameter = "something">I want to escape these &gt;, &lt; and &amp;</tag>"
  • Now, I definitely can't use any xml parsing libraries like xml.dom.minidom or xml.etree because the data is unescaped & will give error
  • In regex, I figure out way to match & get start and end positions of data substing

    exp = re.search(">.+?</", label)
    # Get position of the data between tags
    start = exp.start() + 1
    end = exp.end() - 2
    return label[ : start] + saxutils.escape(label[start : end]) + label[end : ]
    
  • But in re.search, I can't match the exact xml format

  • If I use re.findall I can't get positions of the substrings found
  • I could always find positions of found substring by index but that won't be efficient, I want a simple but efficent solution
  • BeautifulSoup solutions are welcomed but I wish there was some more beautiful way to do it with python's basic libraries
Parth
  • 729
  • 8
  • 23

1 Answers1

3

Perhaps you should be considering re.sub:

>>> oldString = '<tag parameter = "something">I want to escape these >, < and &</tag>'
>>> newString = re.sub(r"(<tag.*?>)(.*?)</tag>", lambda m: m.group(1) + cgi.escape(m.group(2)) + "</tag>", oldString)
>>> print newString
<tag parameter = "something">I want to escape these &gt;, &lt; and &amp;</tag>

My warning is that the regular expression will definitely break if you have nested tags. See Why is it such a bad idea to parse XML with regex?

Community
  • 1
  • 1
icedtrees
  • 6,134
  • 5
  • 25
  • 35
  • its really efficient form of my code, but my primary concern is to match the enclosing tags and also using regex – Parth Mar 05 '14 at 20:50
  • @prth I edited the code to match tags. I'm not sure if this is exactly what you wanted. – icedtrees Mar 05 '14 at 23:21