10

how do you use python 2.6 to remove everything including the <div class="comment"> ....remove all ....</div>

i tried various way using re.sub without any success

Thank you

Carson Myers
  • 37,678
  • 39
  • 126
  • 176
Michelle Jun Lee
  • 227
  • 2
  • 7
  • 14
  • obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Carson Myers Apr 15 '10 at 23:57

6 Answers6

18

This can be done easily and reliably using an HTML parser like BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<body><div>1</div><div class="comment"><strong>2</strong></div></body>')
>>> for div in soup.findAll('div', 'comment'):
...   div.extract()
... 
<div class="comment"><strong>2</strong></div>
>>> soup
<body><div>1</div></body>

See this question for examples on why parsing HTML using regular expressions is a bad idea.

Community
  • 1
  • 1
Ayman Hourieh
  • 132,184
  • 23
  • 144
  • 116
3

With lxml.html:

from lxml import html
doc = html.fromstring(input)
for el in doc.cssselect('div.comment'):
    el.drop_tree()
result = html.tostring(doc)
Ian Bicking
  • 9,762
  • 6
  • 33
  • 32
2

You cannot properly parse HTML with regular expressions. Use a HTML parser such as lxml or BeautifulSoup.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
0

For the record, it is usually a bad idea to process XML with regular expressions. Nevertheless:

>>> re.sub('>[^<]*', '>', '<div class="comment> .. any… </div>')
'<div class="comment></div>'
  • I wonder if the OP wishes to also remove the bookend items of the DIV tag itself in addition to the contents. – Jarret Hardie Apr 16 '10 at 00:06
  • yes, basically , i am trying to remove from the start to the end of the div, other example, say like you want to remove certain table within html contents such as remove all ...
    ,
    – Michelle Jun Lee Apr 16 '10 at 00:11
0

non regex way

pat='<div class="comment">'
for chunks in htmlstring.split("</div>"):
    m=chunks.find(pat)
    if m!=-1:
       chunks=chunks[:m]
    print chunks

output

$ cat file
one two <tag> ....</tag>
 adsfh asdf <div class="comment"> ....remove
all ....</div>s sdfds
<div class="blah" .......
.....
blah </div>

$ ./python.py
one two <tag> ....</tag>
 adsfh asdf
s sdfds
<div class="blah" .......
.....
blah
ghostdog74
  • 327,991
  • 56
  • 259
  • 343
0

Use Beautiful soup and do something like this to get all of those elements, and then just replace inside

tomatosoup = BeautifulSoup(myhtml)

tomatochunks = tomatosoup.findall("div", {"class":"comment"} )

for chunk in tomatochunks:
   #remove the stuff
JiminyCricket
  • 7,050
  • 7
  • 42
  • 59