How can i remove

with python sub

Question

I have an html file and I want to replace the empty paragraphs with a space.

mystring = "This <p></p><p>is a test</p><p></p><p></p>"
result = mystring.sub("<p></p>" , "&nbsp;")

This is not working.

... keeping in mind that regex is bound to fail on someone's malformed html, like '
This
is a
test
p>' — Hugh Bothwell, Mar 23 '11 at 14:28

score 10 · Accepted Answer · edited May 23 '17 at 12:17

10

Please, don't try to parse HTML with regular expressions. Use a proper parsing module, like htmlparser or BeautifulSoup to achieve this. "Suffer" a short learning curve now and benefit:

Your parsing code will be more robust, handling corner cases you may not have considered that will fail with a regex
For future HTML parsing/munging tasks, you will be empowered to do things faster, so eventually the time investment pays off as well.

You won't be sorry! Profit guaranteed!

edited May 23 '17 at 12:17

Community

1
1

answered Mar 23 '11 at 13:56

Eli Bendersky

263,248
89
350
412

it is already existing code in the company I got hired and they want to alter the regex...I have the suspicion some time soon it will fail. I ll try to move to a more robust solution. – topless Mar 23 '11 at 14:05

score 5 · Answer 2 · edited May 23 '17 at 12:31

5

I think it's always nice to give an example of how to do this with a real parser, as well as just repeating the sound advice that Eli Bendersky gives in his answer.

Here's an example of how to remove empty  elements using lxml. lxml's HTMLParser deals with HTML very well.

from lxml import etree
from StringIO import StringIO

input = '''This <p> </p><p>is a test</p><p></p><p><b>Bye.</b></p>'''

parser = etree.HTMLParser()
tree = etree.parse(StringIO(input), parser)

for p in tree.xpath("//p"):
    if len(p):
        continue
    t = p.text
    if not (t and t.strip()):
        p.getparent().remove(p)

print etree.tostring(tree.getroot(), pretty_print=True)

... which produces the output:

<html>
  <body>
    <p>This </p>
    <p>is a test</p>
    <p>
      <b>Bye.</b>
    </p>
  </body>
</html>

Note that I misread the question when replying to this, and I'm only removing the empty  elements, not replacing them with &nbsp. With lxml, I'm not sure of a simple way to do this, so I've created another question to ask:

How can one replace an element with text in lxml?

edited May 23 '17 at 12:31

Community

1
1

answered Mar 23 '11 at 14:10

Mark Longair

446,582
72
411
327

++ for a real example, but you're smearing `BeautifulSoup` for no reason. The page you links to explicitly says it's an old version and historic, and the module no longer has these problems – Eli Bendersky Mar 23 '11 at 16:25
@Eli Bendersky: thanks for correcting that misunderstanding, I've removed that bit from my answer now. I didn't realize that situation had changed, and hadn't re-read the page I linked to - mea culpa... – Mark Longair Mar 23 '11 at 16:32
There is a bug in this snippet. If all the text content of p is wrapped in another element - `
This will be dropped
`, above code will remove that since `node.text` only contains the text that is present between the start tag and the start of the first child. – abhaga Jun 01 '12 at 02:19

score 2 · Answer 3 · answered Mar 23 '11 at 14:03

2

I think for this particular problem a parsing module would be overkill

simply that function:

>>> mystring = "This <p></p><p>is a test</p><p></p><p></p>"

>>> mystring.replace("<p></p>","&nbsp;")
'This &nbsp;<p>is a test</p>&nbsp;&nbsp;'

answered Mar 23 '11 at 14:03

Xavier Combelle

10,968
5
28
52

score 2 · Answer 4 · answered Mar 23 '11 at 15:56

What if  is entered as , or , or has an attribute added, or is given using the empty tag syntax ? Pyparsing's HTML tag support handles all of these variations:

from pyparsing import makeHTMLTags, replaceWith, withAttribute

mystring = 'This <p></p><p>is a test</p><p align="left"></p><P> </p><P/>'

p,pEnd = makeHTMLTags("P")
emptyP = p.copy().setParseAction(withAttribute(empty=True))

null_paragraph = emptyP | p+pEnd
null_paragraph.setParseAction(replaceWith("&nbsp;"))

print null_paragraph.transformString(mystring)

Prints:

This &nbsp;<p>is a test</p>&nbsp;&nbsp;&nbsp;

score 1 · Answer 5 · answered Mar 23 '11 at 13:59

1

using regexp ?

import re
result = re.sub("<p>\s*</p>","&nbsp;", mystring, flags=re.MULTILINE)

compile the regexp if you use it often.

answered Mar 23 '11 at 13:59

Yannick Loiseau

1,394
8
8

score 0 · Answer 6 · answered Apr 12 '12 at 10:53

I wrote that code:

from lxml import etree
from StringIO import StringIO

html_tags = """<div><ul><li>PID temperature controller</li> <li>Smart and reliable</li> <li>Auto-diagnosing</li> <li>Auto setting</li> <li>Intelligent control</li> <li>2-Rows 4-Digits LED display</li> <li>Widely applied in the display and control of the parameter of temperature, pressure, flow, and liquid level</li> <li>     </li> <p> </p></ul> <div> </div></div>"""

document = etree.iterparse(StringIO(html_tags), html=True)

for a, e in document:
    if not (e.text and e.text.strip()) and len(e) == 0:
        e.getparent().remove(e)

print etree.tostring(document.root)

Why the down vote? It's a viable solution. It's noteworthy, though, that lxml return a valid HTML string. So the input string will get wrapped in and tags. Output of the given example string is therefore: "
This
is a test
" — Simon Steinberger, Jun 14 '13 at 21:53

How can i remove

with python sub

6 Answers6

Linked

How can i remove with python sub

6 Answers6

Linked

How can i remove

with python sub