1

I need to remove all <p></p> that are only <p>'s in <td>.
But how it can be done?

import re
text = """
    <td><p>111</p></td>
    <td><p>111</p><p>222</p></td>
    """
text = re.sub(r'<td><p>(??no</p>inside??)</p></td>', r'<td>\1</td>', text)

How can I match without</p>inside?

Qiao
  • 16,565
  • 29
  • 90
  • 117
  • 12
    Don't parse HTML with regex. Please... – Blender Oct 04 '11 at 18:17
  • 2
    You can look at BeautifulSoup as an actual (X)HTML Parser, but attempting to manipulate HTML with regex is a __bad__ idea. You're only asking for headaches. – g.d.d.c Oct 04 '11 at 18:19
  • What should I use for this problem? Wouldn't DOM be overkill? – Qiao Oct 04 '11 at 18:19
  • 3
    Hi, Qiao. See this post on why not parse with regex: http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not, and http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns – Jonathan M Oct 04 '11 at 18:19
  • 4
    About parsing html with regexes, see the accepted answer to this question: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – Paolo Tedesco Oct 04 '11 at 18:20
  • @Qiao: Why? Any amount of tag soup can make a DOM. – BoltClock Oct 04 '11 at 18:20

4 Answers4

1

I would use minidom. I stole the following snippet from here which you should be able to modify and work for you:

from xml.dom import minidom

doc = minidom.parse(myXmlFile)
for element in doc.getElementsByTagName('MyElementName'):
    if element.getAttribute('name') in ['AttrName1', 'AttrName2']:
        parentNode = element.parentNode
        parentNode.insertBefore(doc.createComment(element.toxml()), element)
        parentNode.removeChild(element)
f = open(myXmlFile, "w")
f.write(doc.toxml())
f.close()

Thanks @Ivo Bosticky

Community
  • 1
  • 1
Matt Williamson
  • 39,165
  • 10
  • 64
  • 72
1

While using regexps with HTML is bad, matching a string that does not contain a given pattern is an interesting question in itself.

Let's assume that we want to match a string beginning with an a and ending with a z and take out whatever is in between only when string bar is not found inside.

Here's my take: "a((?:(?<!ba)r|[^r])+)z"

It basically says: find a, then find either an r which is not preceded by ba, or something different than r (repeat at least once), then find a z. So, a bar cannot sneak in into the catch group.

Note that this approach uses a 'negative lookbehind' pattern and only works with lookbehind patterns of fixed length (like ba).

9000
  • 39,899
  • 9
  • 66
  • 104
0

I would definitely recommend using BeautifulSoup for this. It's a python HTML/XML parser.

http://www.crummy.com/software/BeautifulSoup/

varunl
  • 19,499
  • 5
  • 29
  • 47
0

Not quite sure why you want to remove the P tags which don't have closing tags. However, if this is an attempt to clean code, an advantage of BeautifulSoup is that is can clean HTML for you:

from BeautifulSoup import BeautifulSoup
html = """
<td><p>111</td>
<td><p>111<p>222</p></td>
"""
soup = BeautifulSoup(html)
print soup.prettify()

this doesn't get rid of your unmatched tags, but it fixes the missing ones.

Tim Richardson
  • 6,608
  • 6
  • 44
  • 71