3

I have an html file and I want to replace the empty paragraphs with a space.

mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , "&nbsp;")

This is not working.

topless
  • 8,069
  • 11
  • 57
  • 86

6 Answers6

10

Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:

  1. Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
  2. For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.

You won't be sorry! Profit guaranteed!

Community
  • 1
  • 1
Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • it is already existing code in the company I got hired and they want to alter the regex...I have the suspicion some time soon it will fail. I ll try to move to a more robust solution. – topless Mar 23 '11 at 14:05
5

I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.

Here's an example of how to remove empty <p> elements using lxml. lxml's HTMLParser deals with HTML very well.

from lxml import etree
from StringIO import StringIO

input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''

parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)

for p in tree.xpath("//p"):
    if len(p):
        continue
    t = p.text
    if not (t and t.strip()):
        p.getparent().remove(p)

print etree.tostring(tree.getroot(), pretty_print=True)

... which produces the output:

<html>
  <body>
    <p>This </p>
    <p>is a test</p>
    <p>
      <b>Bye.</b>
    </p>
  </body>
</html>

Note that I misread the question when replying to this, and I'm only removing the empty <p> elements, not replacing them with &nbsp. With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:

Community
  • 1
  • 1
Mark Longair
  • 446,582
  • 72
  • 411
  • 327
  • ++ for a real example, but you're smearing `BeautifulSoup` for no reason. The page you links to explicitly says it's an old version and historic, and the module no longer has these problems – Eli Bendersky Mar 23 '11 at 16:25
  • @Eli Bendersky: thanks for correcting that misunderstanding, I've removed that bit from my answer now. I didn't realize that situation had changed, and hadn't re-read the page I linked to - mea culpa... – Mark Longair Mar 23 '11 at 16:32
  • There is a bug in this snippet. If all the text content of p is wrapped in another element - `

    This will be dropped

    `, above code will remove that since `node.text` only contains the text that is present between the start tag and the start of the first child.
    – abhaga Jun 01 '12 at 02:19
2

I think for this particular problem a parsing module would be overkill

simply that function:

>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"

>>> mystring.replace("<p></p>","&nbsp;")
'This &nbsp;<p>is a test</p>&nbsp;&nbsp;'
Xavier Combelle
  • 10,968
  • 5
  • 28
  • 52
2

What if <p> is entered as <P>, or < p >, or has an attribute added, or is given using the empty tag syntax <P/>? Pyparsing's HTML tag support handles all of these variations:

from pyparsing import makeHTMLTags, replaceWith, withAttribute

mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'

p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))

null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith("&nbsp;"))

print null_paragraph.transformString(mystring)

Prints:

This &nbsp;<p>is a test</p>&nbsp;&nbsp;&nbsp;
PaulMcG
  • 62,419
  • 16
  • 94
  • 130
1

using regexp ?

import re
result = re.sub("<p>\s*</p>","&nbsp;", mystring, flags=re.MULTILINE)

compile the regexp if you use it often.

Yannick Loiseau
  • 1,394
  • 8
  • 8
0

I wrote that code:

from lxml import etree
from StringIO import StringIO

html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li>     </li> <p> </p></ul> <div> </div></div>"""

document = etree.iterparse(StringIO(html_tags), html=True)

for a, e in document:
    if not (e.text and e.text.strip()) and len(e) == 0:
        e.getparent().remove(e)

print etree.tostring(document.root)
swietyy
  • 786
  • 1
  • 9
  • 15
  • Why the down vote? It's a viable solution. It's noteworthy, though, that lxml return a valid HTML string. So the input string will get wrapped in and tags. Output of the given example string is therefore: "

    This

    is a test

    "
    – Simon Steinberger Jun 14 '13 at 21:53