2

I'm trying to manipulate a HTML-File and remove a div with a certain id-Tag, using Python3.

Is there a more elegant way to manipulate or remove this container than a mix of for-Loops and regex?

I know, there is the HTMLParser module, but I'm not sure if this will help me (it finds the corresponding tags, but how to remove those and the contents?).

2 Answers2

2

Try lxml and css/xpath queries.

For example, with this html:

<html>
  <body>
    <p>Some text in a p.</p>
    <div class="go-away">Some text in a div.</div>
    <div><p>Some text in a p in a div</p></div>
  </body>
</html>

You can read that in, remove the div with class "go-away", and output the result with:

import lxml.html

html = lxml.html.fromstring(html_txt)
go_away = html.cssselect('.go-away')[0] # Or with suitable xpath
go_away.getparent().remove(go_away)

lxml.html.tostring(html) # Or lxml.html.tostring(html).decode("utf-8") to get a string
Antillean
  • 76
  • 1
  • 2
-1

While I can't stress this enough

DON'T PARSE HTML WITH REGEX!!

here's how I'd do it with regex.

from re import sub
new_html = sub('<div class=(\'go-away\'|"go-away")>.*?</div>', '', html)

Even though I think that should be ok, you should never ever use regex to parse anything. More often than anything it creates odd, hard-to-debug issues. It'll create more work for you than you started with. Don't parse with regex.

OnGle
  • 132
  • 6
  • Yes, indeed. Don't do it. [Or you may go mad.](http://stackoverflow.com/a/1732454/675568) – Tom Zych Dec 15 '15 at 00:12
  • Thanks for sharing your experience! And this is the exact reason why I try to avoid doing it with regex. It simply sucks because no matter what you do, things break so much quicker with regex. – Felix Suchert Dec 15 '15 at 23:23
  • There are times for regex and there are times when it's not appropriate, it's absolutely brilliant for high volumes of predictable data. :) – OnGle Dec 15 '15 at 23:40