Match "without this"

Question

I need to remove all <p></p> that are only <p>'s in <td>.
But how it can be done?

import re
text = """
    <td><p>111</p></td>
    <td><p>111</p><p>222</p></td>
    """
text = re.sub(r'<td><p>(??no</p>inside??)</p></td>', r'<td>\1</td>', text)

How can I match without</p>inside?

You can look at BeautifulSoup as an actual (X)HTML Parser, but attempting to manipulate HTML with regex is a __bad__ idea. You're only asking for headaches. — g.d.d.c, Oct 04 '11 at 18:19
What should I use for this problem? Wouldn't DOM be overkill? — Qiao, Oct 04 '11 at 18:19
Hi, Qiao. See this post on why not parse with regex: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not, and http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns — Jonathan M, Oct 04 '11 at 18:19
About parsing html with regexes, see the accepted answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Paolo Tedesco, Oct 04 '11 at 18:20

score 1 · Accepted Answer · edited May 23 '17 at 12:12

I would use minidom. I stole the following snippet from here which you should be able to modify and work for you:

from xml.dom import minidom

doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
    if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
        parentNode = element.parentNode
        parentNode.insertBefore(doc.createComment(element.toxml()), element)
        parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()

Thanks @Ivo Bosticky

score 1 · Answer 2 · answered Oct 04 '11 at 19:23

While using regexps with HTML is bad, matching a string that does not contain a given pattern is an interesting question in itself.

Let's assume that we want to match a string beginning with an a and ending with a z and take out whatever is in between only when string bar is not found inside.

Here's my take: "a((?:(?<!ba)r|[^r])+)z"

It basically says: find a, then find either an r which is not preceded by ba, or something different than r (repeat at least once), then find a z. So, a bar cannot sneak in into the catch group.

Note that this approach uses a 'negative lookbehind' pattern and only works with lookbehind patterns of fixed length (like ba).

score 0 · Answer 3 · answered Oct 04 '11 at 19:39

0

I would definitely recommend using BeautifulSoup for this. It's a python HTML/XML parser.

http://www.crummy.com/software/BeautifulSoup/

answered Oct 04 '11 at 19:39

varunl

19,499
5
29
47

score 0 · Answer 4 · answered Oct 05 '11 at 05:55

Not quite sure why you want to remove the P tags which don't have closing tags. However, if this is an attempt to clean code, an advantage of BeautifulSoup is that is can clean HTML for you:

from BeautifulSoup import BeautifulSoup
html = """
<td><p>111</td>
<td><p>111<p>222</p></td>
"""
soup = BeautifulSoup(html)
print soup.prettify()

this doesn't get rid of your unmatched tags, but it fixes the missing ones.

Match "without this"

4 Answers4