How to remove all content between two HTML comments using BeautifulSoup

Question

<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<div><span id="company">Apple</span> Chats:</div>
<div>abcdefg<span>xvfdadsad</span>sdfsdfsdf</div>
<div>
<li>(<span>7</span>sadsafasf<span>vdvdsfdsfds</span></li>
<li>(<span>8</span>) <span>Reim</span></li>
</div>
<!-- Ad -->
<a href="#">

I want to remove all contents between the two comment lines using bs4 and make the file into something like:

<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<!-- Ad -->
<a href="#">

Have you tried writing code yet? [Extracting Text Between HTML Comments with BeautifulSoup](https://stackoverflow.com/questions/34673851/extracting-text-between-html-comments-with-beautifulsoup) and [this answer showing `.extract()`](https://stackoverflow.com/a/5598678/6243352) and finally [writing your soup to file](https://stackoverflow.com/a/40530238/6243352) might help you get started. Once you have code, feel free to [edit](https://stackoverflow.com/posts/67629271/edit) to show where you're stuck. Thanks. — ggorlen, May 20 '21 at 23:58
Is that the entire HTML? or there's more? If there is, please add a bit more of the HTML — MendelG, May 21 '21 at 00:05
@ggorlen Yes, I have tried the code in the first question you posted before I post my question. But that code is not returning anything for me. — Lykosz, May 21 '21 at 00:18
@MendelG It's a large page of HTML, which I don't know if I am allowed to post. But this is the only part I want to make change. — Lykosz, May 21 '21 at 00:19

ggorlen · Accepted Answer · 2021-05-21T01:38:41.337

First of all, be careful with snippets of HTML taken out of context. If you print your soupified snippet, you'll get:

<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<html>
 <body>
  <div>
   <span id="company">
   ...

Whoops--BS added the comment above the <html> tag, pretty clearly not your intent as an algorithm to remove elements between the two tags would inevitably remove the entire document (that's why including your code is important...).

As for the main task, element.decompose() or element.extract() will remove it from the tree (extract() returns it, minor subtlety). Elements to be removed in a walk need to be kept in a separate list and removed after the traversal ends.

from bs4 import BeautifulSoup, Comment

html = """
<body>
<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<div><span id="company">Apple</span> Chats:</div>
<div>abcdefg<span>xvfdadsad</span>sdfsdfsdf</div>
<div>
<li>(<span>7</span>sadsafasf<span>vdvdsfdsfds</span></li>
<li>(<span>8</span>) <span>Reim</span></li>
</div>
<!-- Ad -->
<a href="#">
"""
start_comment = " Top Plans & Programs: Most Common User Phrases - List Bucket 6 "
end_comment = " Ad "
soup = BeautifulSoup(html, "lxml")
to_extract = []
between_comments = False

for x in soup.recursiveChildGenerator():
    if between_comments and not isinstance(x, str):
        to_extract.append(x)

    if isinstance(x, Comment):
        if start_comment == x:
            between_comments = True
        elif end_comment == x:
            break

for x in to_extract:
    x.decompose()

print(soup.prettify())

Output:

<html>
 <body>
  <!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
  <!-- Ad -->
  <a href="#">
  </a>
 </body>
</html>

Note that if the ending comment isn't at the same level as the starting comment, this will destroy all parent elements of the ending comment. If you don't want that, you'll need to walk back up the parent chain until you reach the level of the starting comment.

Another solution using .find and .next (same imports/HTML string/output as above):

start_comment = " Top Plans & Programs: Most Common User Phrases - List Bucket 6 "
end_comment = " Ad "
soup = BeautifulSoup(html, "lxml")
el = soup.find(text=lambda x: isinstance(x, Comment) and start_comment == x)
end = el.find_next(text=lambda x: isinstance(x, Comment) and end_comment == x)
to_extract = []

while el and end and el is not end:
    if not isinstance(el, str):
        to_extract.append(el)

    el = el.next

for x in to_extract:
    x.decompose()

print(soup.prettify())

This works like a charm. Very high quality and detailed answer. — Lykosz, May 24 '21 at 17:12

MendelG · Answer 2 · 2021-05-21T01:06:49.503

0

You can remove the div's using the .decompose() method. Since the comments are of type Comment, BeautifulSoup won't see them, so find_all() div's:

# Find all the elements after the tag with `id="company"`
for tag in soup.find("span", id="company").next_elements:
    # Break once we encounter an `a` since all the comments have finished
    if tag.name == "a":
        break
    else:
        try:
            tag.previous_sibling.decompose()
        except AttributeError:
            continue

print(soup.prettify())

Output:

<!-- Top Plans & Programs: Most Common User Phrases - List Bucket 6 -->
<!-- Ad -->
<a href="#">
</a>

edited May 21 '21 at 01:06

answered May 21 '21 at 00:12

MendelG

14,885
4
25
52

But the problem is that I have div tags not in between those two comments. And I don't want to delete those. – Lykosz May 21 '21 at 00:17
That's why I commented on your question if that's the entire HTML. Please edit your question with more HTML – MendelG May 21 '21 at 00:18
I replied in the comment area for that request. This one is using comment: https://stackoverflow.com/questions/34673851/extracting-text-between-html-comments-with-beautifulsoup But somehow the code isn't working for me. Could you take a look at that post? – Lykosz May 21 '21 at 00:22
1

This assumes the tags are in the exact pattern OP showed. If anything changes, it will fail. – ggorlen May 21 '21 at 01:51

How to remove all content between two HTML comments using BeautifulSoup

2 Answers2