How can I strip comment tags from HTML using BeautifulSoup?

Question

I have been playing with BeautifulSoup, which is great. My end goal is to try and just get the text from a page. I am just trying to get the text from the body, with a special case to get the title and/or alt attributes from <a> or <img> tags.

So far I have this EDITED & UPDATED CURRENT CODE:

soup = BeautifulSoup(page)
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
page = ''.join(soup.findAll(text=True))
page = ' '.join(page.split())
print page

1) What do you suggest the best way for my special case to NOT exclude those attributes from the two tags I listed above? If it is too complex to do this, it isn't as important as doing #2.

2) I would like to strip tags and everything in between them. How would I go about that?

QUESTION EDIT @jathanism: Here are some comment tags that I have tried to strip, but remain, even when I use your example

<!-- Begin function popUp(URL) { day = new Date(); id = day.getTime(); eval("page" + id + " = window.open(URL, '" + id + "', 'toolbar=0,scrollbars=0,location=0,statusbar=0,menubar=0,resizable=0,width=300,height=330,left = 774,top = 518');"); } // End -->
<!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var MenuBar1 = new Spry.Widget.MenuBar("MenuBar1", {imgDown:"SpryAssets/SpryMenuBarDownHover.gif", imgRight:"SpryAssets/SpryMenuBarRightHover.gif"}); //--> <!-- var whichlink=0 var whichimage=0 var blenddelay=(ie)? document.images.slide.filters[0].duration*1000 : 0 function slideit(){ if (!document.images) return if (ie) document.images.slide.filters[0].apply() document.images.slide.src=imageholder[whichimage].src if (ie) document.images.slide.filters[0].play() whichlink=whichimage whichimage=(whichimage<slideimages.length-1)? whichimage+1 : 0 setTimeout("slideit()",slidespeed+blenddelay) } slideit() //-->

Is there a source document you're using as a test case? It would be really helpful if you could provide something you have in mind as a basis for comparison. — jathanism, Aug 17 '10 at 22:01

score 64 · Answer 1 · answered Aug 17 '10 at 22:07

64

Straight from the documentation for BeautifulSoup, you can easily strip comments (or anything) using extract():

from BeautifulSoup import BeautifulSoup, Comment
soup = BeautifulSoup("""1<!--The loneliest number-->
                        <a>2<!--Can be as bad as one--><b>3""")
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
print soup
# 1
# <a>2<b>3</b></a>

answered Aug 17 '10 at 22:07

jathanism

33,067
9
68
86

I don't know why I didn't see that. Thank you for waking me up! – Nathan Aug 17 '10 at 22:15
5

Nice. But it looks all icky to do a list comprehension with side-effects :p. How about `map( lambda x: x.extract(), comments )`? – Katriel Aug 17 '10 at 22:23
I am still trying to figure out why it doesn't find and strip tags like this `` Those backslashes cause certain tags to be overlooked – Nathan Aug 17 '10 at 22:26
1

Has something changed in BeautifulSoup? I tried with 3.2.0 and it has no problem with comments like ``. – Kiran Jonnalagadda Mar 09 '11 at 09:42

score 3 · Accepted Answer · answered Aug 17 '10 at 23:06

3

I am still trying to figure out why it doesn't find and strip tags like this: . Those backslashes cause certain tags to be overlooked.

This may be a problem with the underlying SGML parser: see http://www.crummy.com/software/BeautifulSoup/documentation.html#Sanitizing%20Bad%20Data%20with%20Regexps. You can override it by using a markupMassage regex -- straight from the docs:

import re, copy

myMassage = [(re.compile('<!-([^-])'), lambda match: '<!--' + match.group(1))]
myNewMassage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
myNewMassage.extend(myMassage)

BeautifulSoup(badString, markupMassage=myNewMassage)
# Foo<!--This comment is malformed.-->Bar<br />Baz

answered Aug 17 '10 at 23:06

Katriel

120,462
19
136
170

2

This is a tough one and this looks to be a good workaround. Sad that it still ends up using regex to parse HTML. Stupid regex! – jathanism Aug 17 '10 at 23:59
OK I will work on the re.compile to detect the messed up comments I listed. Need to brush up on my regex's though. blech. – Nathan Aug 18 '10 at 00:21
@jathanism -- BeautifulSoup uses several regexes internally to polish the HTML before it feeds it to `sgmllib`. It's not pretty, but it's not Lovecraftian either. – Katriel Aug 18 '10 at 08:37
4

Just to update this old post, the BeautifulSoup.MARKUP_MASSAGE has been deprecated. "The BeautifulSoup constructor no longer recognizes the markupMassage argument. It’s now the parser’s responsibility to handle markup correctly." http://www.crummy.com/software/BeautifulSoup/bs4/doc/ (At the very bottom of them page) – Timber Oct 30 '14 at 13:19

Vanjith · Answer 3 · 2019-06-21T12:30:45.067

0

If you are looking for solution in BeautifulSoup version 3 BS3 Docs - Comment

soup = BeautifulSoup("""Hello! <!--I've got to be nice to get what I want.-->""")
comment = soup.find(text=re.compile("if"))
Comment=comment.__class__
for element in soup(text=lambda text: isinstance(text, Comment)):
    element.extract()
print soup.prettify()

edited Jun 21 '19 at 12:30

answered Apr 16 '19 at 09:55

Vanjith

520
4
23

score 0 · Answer 4 · answered Jul 05 '20 at 15:54

0

if mutation isn't your bag, you can

[t for t in soup.find_all(text=True) if not isinstance(t, Comment)]

answered Jul 05 '20 at 15:54

Joffer

1,921
2
21
23

How can I strip comment tags from HTML using BeautifulSoup?

4 Answers4

Linked

Related