Manipulating HTML-Code in Python3

Question

I'm trying to manipulate a HTML-File and remove a div with a certain id-Tag, using Python3.

Is there a more elegant way to manipulate or remove this container than a mix of for-Loops and regex?

I know, there is the HTMLParser module, but I'm not sure if this will help me (it finds the corresponding tags, but how to remove those and the contents?).

score 2 · Accepted Answer · answered Dec 14 '15 at 23:44

Try lxml and css/xpath queries.

For example, with this html:

<html>
  <body>
    <p>Some text in a p.</p>
    <div class="go-away">Some text in a div.</div>
    <div><p>Some text in a p in a div</p></div>
  </body>
</html>

You can read that in, remove the div with class "go-away", and output the result with:

import lxml.html

html = lxml.html.fromstring(html_txt)
go_away = html.cssselect('.go-away')[0] # Or with suitable xpath
go_away.getparent().remove(go_away)

lxml.html.tostring(html) # Or lxml.html.tostring(html).decode("utf-8") to get a string

Thanks, **lxml** was exactly what I was looking for! – Felix Suchert Dec 16 '15 at 17:58 — Felix Suchert, Dec 16 '15 at 17:58

score -1 · Answer 2 · answered Dec 15 '15 at 00:02

-1

While I can't stress this enough

DON'T PARSE HTML WITH REGEX!!

here's how I'd do it with regex.

from re import sub
new_html = sub('<div class=(\'go-away\'|"go-away")>.*?</div>', '', html)

Even though I think that should be ok, you should never ever use regex to parse anything. More often than anything it creates odd, hard-to-debug issues. It'll create more work for you than you started with. Don't parse with regex.

answered Dec 15 '15 at 00:02

OnGle

132
6

Yes, indeed. Don't do it. [Or you may go mad.](http://stackoverflow.com/a/1732454/675568) – Tom Zych Dec 15 '15 at 00:12
Thanks for sharing your experience! And this is the exact reason why I try to avoid doing it with regex. It simply sucks because no matter what you do, things break so much quicker with regex. – Felix Suchert Dec 15 '15 at 23:23
There are times for regex and there are times when it's not appropriate, it's absolutely brilliant for high volumes of predictable data. :) – OnGle Dec 15 '15 at 23:40

Manipulating HTML-Code in Python3

2 Answers2

DON'T PARSE HTML WITH REGEX!!