4

I have a html as below

<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>
        
        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>

        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="French">French</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>

        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="Norman">Norman</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>

Inside each element <details data-level="2" open="">, there is an element <h2 id="English">English</h2> to denote the language. My goal is to delete all <details data-level="2" open=""> whose language is different from English. My expected result is

<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>

        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>

I obtain such result by

from bs4 import BeautifulSoup

texte = """
<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>

        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>
"""

soup = BeautifulSoup(texte, 'html.parser')
tmp = soup.select('details > summary > h2')
tmp2 = [s.contents[0] for s in tmp]

for i in range(len(tmp2)):
    if tmp2[i] != 'English':
        tmp[i].find_parent('details').decompose()
        
soup

I need to repeat this operation nearly 4 millions of times. I would like to ask of there is a more efficient way to do so. Thank you so much for your help!

Akira
  • 2,594
  • 3
  • 20
  • 45
  • 1
    maybe try to do something with a Rabin-Karp string search where you hash the strings and only looks at the strings if the hashes match...i'm just trying to think of strategies to match strings faster within a for loop. @Andrej Kesely is probably faster with BeautifulSoup though... – rocket_boomerang_19 Apr 21 '21 at 23:55

1 Answers1

2

You can use CSS selector with :not() and then .extract() selected elements:

for d in soup.select('details[data-level="2"]:not(:has(h2#English))'):
    d.extract()

print(soup.prettify())

Prints:

<div class="content mw-parser-output" id="bodyContent">
 <div id="mw-content-text" style="direction: ltr;">
  <h1 aria-haspopup="true" class="section-heading" data-section-id="0" tabindex="0">
   <span class="mw-headline" id="title_0">
    pomme
   </span>
  </h1>
  <details data-level="2" open="">
   <summary class="section-heading">
    <h2 id="English">
     English
    </h2>
   </summary>
   <details data-level="3" open="">
    abc
   </details>
  </details>
 </div>
</div>
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • Hi Andrej, can you have a look at [this question](https://stackoverflow.com/questions/67208933/how-to-check-if-a-soup-contains-an-element) and elaborate on why `soup.find('details[data-level="2"]:has(h2#English)')` did not work? – Akira Apr 22 '21 at 08:05
  • 1
    @LEAnhDung `.find()` method doesn't accept CSS selectors. use `.select()` or `.select_one()` instead. – Andrej Kesely Apr 22 '21 at 08:23
  • Thank you so much Andrej. Your answer is as elegant as always :)) – Akira Apr 22 '21 at 08:24