python remove everything between

Asked Apr 15 '10 at 23:50

Active Apr 16 '10 at 02:56

Viewed 6,844 times

10

how do you use python 2.6 to remove everything including the `<div class="comment"> ....remove all ....</div>`

i tried various way using re.sub without any success

Thank you

python class html

edited Apr 15 '10 at 23:54
Carson Myers

37,678

39

126

176

asked Apr 15 '10 at 23:50
Michelle Jun Lee

227

2

7

14

obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Carson Myers Apr 15 '10 at 23:57

6 Answers6

18

This can be done easily and reliably using an HTML parser like BeautifulSoup:

`>>> from BeautifulSoup import BeautifulSoup >>> soup = BeautifulSoup('<body><div>1</div><div class="comment"><strong>2</strong></div></body>') >>> for div in soup.findAll('div', 'comment'): ... div.extract() ... <div class="comment"><strong>2</strong></div> >>> soup <body><div>1</div></body>`

See this question for examples on why parsing HTML using regular expressions is a bad idea.

edited May 23 '17 at 10:30
Community

1

1

answered Apr 16 '10 at 00:26
Ayman Hourieh

132,184

23

144

116

3

With lxml.html:

`from lxml import html doc = html.fromstring(input) for el in doc.cssselect('div.comment'): el.drop_tree() result = html.tostring(doc)`

answered Apr 16 '10 at 02:56
Ian Bicking

9,762

6

33

32

2

You cannot properly parse HTML with regular expressions. Use a HTML parser such as lxml or BeautifulSoup.

answered Apr 15 '10 at 23:56
Ignacio Vazquez-Abrams

776,304

153

1,341

1,358

i am trying to remove everything including the div and everything in between. I cant seems to find any reference about that in BeautifulSoup – Michelle Jun Lee Apr 16 '10 at 00:08

another example, say like i want to remove ...
, so i am trying to remove all the tables in the html contents, i am not sure how do you do that in BeautifulSoup – Michelle Jun Lee Apr 16 '10 at 00:10

Not even in the "Removing elements" subsection of the "Modifying the Parse Tree" section of the documentation? – Ignacio Vazquez-Abrams Apr 16 '10 at 00:10

yes, i saw that but you cant remove specific class or id related to that tag – Michelle Jun Lee Apr 16 '10 at 00:13

0

For the record, it is usually a bad idea to process XML with regular expressions. Nevertheless:

`>>> re.sub('>[^<]*', '>', '<div class="comment> .. any… </div>') '<div class="comment></div>'`

answered Apr 15 '10 at 23:58
David Schein

67

6

I wonder if the OP wishes to also remove the bookend items of the DIV tag itself in addition to the contents. – Jarret Hardie Apr 16 '10 at 00:06

yes, basically , i am trying to remove from the start to the end of the div, other example, say like you want to remove certain table within html contents such as remove all ...
, – Michelle Jun Lee Apr 16 '10 at 00:11

0

non regex way

`pat='<div class="comment">' for chunks in htmlstring.split("</div>"): m=chunks.find(pat) if m!=-1: chunks=chunks[:m] print chunks`

output

`$ cat file one two <tag> ....</tag> adsfh asdf <div class="comment"> ....remove all ....</div>s sdfds <div class="blah" ....... ..... blah </div> $ ./python.py one two <tag> ....</tag> adsfh asdf s sdfds <div class="blah" ....... ..... blah`

answered Apr 16 '10 at 00:07
ghostdog74

327,991

56

259

343

0

Use Beautiful soup and do something like this to get all of those elements, and then just replace inside

`tomatosoup = BeautifulSoup(myhtml) tomatochunks = tomatosoup.findall("div", {"class":"comment"} ) for chunk in tomatochunks: #remove the stuff`

answered Apr 16 '10 at 00:43
JiminyCricket

7,050

7

42

59

also if its XML and not HTML use BeautifulStoneSoup http://www.crummy.com/software/BeautifulSoup/documentation.html – JiminyCricket Apr 16 '10 at 00:43

Linked

2
Deleting all data between two html tags in python

Question

how do you use python 2.6 to remove everything including the <div class="comment"> ....remove all ....</div>

i tried various way using re.sub without any success

Thank you

obligatory link: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Carson Myers, Apr 15 '10 at 23:57

score 18 · Answer 1 · edited May 23 '17 at 10:30

This can be done easily and reliably using an HTML parser like BeautifulSoup:

>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup('<body><div>1</div><div class="comment"><strong>2</strong></div></body>')
>>> for div in soup.findAll('div', 'comment'):
...   div.extract()
... 
<div class="comment"><strong>2</strong></div>
>>> soup
<body><div>1</div></body>

See this question for examples on why parsing HTML using regular expressions is a bad idea.

score 3 · Answer 2 · answered Apr 16 '10 at 02:56

3

With lxml.html:

from lxml import html
doc = html.fromstring(input)
for el in doc.cssselect('div.comment'):
    el.drop_tree()
result = html.tostring(doc)

answered Apr 16 '10 at 02:56

Ian Bicking

9,762
6
33
32

score 2 · Answer 3 · answered Apr 15 '10 at 23:56

2

You cannot properly parse HTML with regular expressions. Use a HTML parser such as lxml or BeautifulSoup.

answered Apr 15 '10 at 23:56

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

i am trying to remove everything including the div and everything in between. I cant seems to find any reference about that in BeautifulSoup – Michelle Jun Lee Apr 16 '10 at 00:08
another example, say like i want to remove ...
, so i am trying to remove all the tables in the html contents, i am not sure how do you do that in BeautifulSoup – Michelle Jun Lee Apr 16 '10 at 00:10
Not even in the "Removing elements" subsection of the "Modifying the Parse Tree" section of the documentation? – Ignacio Vazquez-Abrams Apr 16 '10 at 00:10
yes, i saw that but you cant remove specific class or id related to that tag – Michelle Jun Lee Apr 16 '10 at 00:13

score 0 · Answer 4 · answered Apr 15 '10 at 23:58

0

For the record, it is usually a bad idea to process XML with regular expressions. Nevertheless:

>>> re.sub('>[^<]*', '>', '<div class="comment> .. any… </div>')
'<div class="comment></div>'

answered Apr 15 '10 at 23:58

David Schein

67
6

I wonder if the OP wishes to also remove the bookend items of the DIV tag itself in addition to the contents. – Jarret Hardie Apr 16 '10 at 00:06
yes, basically , i am trying to remove from the start to the end of the div, other example, say like you want to remove certain table within html contents such as remove all ...
, – Michelle Jun Lee Apr 16 '10 at 00:11

score 0 · Answer 5 · answered Apr 16 '10 at 00:07

non regex way

pat='<div class="comment">'
for chunks in htmlstring.split("</div>"):
    m=chunks.find(pat)
    if m!=-1:
       chunks=chunks[:m]
    print chunks

output

$ cat file
one two <tag> ....</tag>
 adsfh asdf <div class="comment"> ....remove
all ....</div>s sdfds
<div class="blah" .......
.....
blah </div>

$ ./python.py
one two <tag> ....</tag>
 adsfh asdf
s sdfds
<div class="blah" .......
.....
blah

score 0 · Answer 6 · answered Apr 16 '10 at 00:43

0

Use Beautiful soup and do something like this to get all of those elements, and then just replace inside

tomatosoup = BeautifulSoup(myhtml)

tomatochunks = tomatosoup.findall("div", {"class":"comment"} )

for chunk in tomatochunks:
   #remove the stuff

answered Apr 16 '10 at 00:43

JiminyCricket

7,050
7
42
59

also if its XML and not HTML use BeautifulStoneSoup http://www.crummy.com/software/BeautifulSoup/documentation.html – JiminyCricket Apr 16 '10 at 00:43