1

I have an html. I would like to check if it contains at least one English section. This is signified by

<summary class="section-heading"><h2 id="English">English</h2></summary>

This operation is performed millions of times. To be efficient, I want checking process stops right after it meets the first of such elements. I tried a method from here. Could you please elaborate on

  • why soup.find('details[data-level="2"]:has(h2#English)') did not work? On the other hand, soup.select_one('details[data-level="2"]:has(h2#English)') works perfectly.

  • how to solve it?


from bs4 import BeautifulSoup

texte = """
<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>     
        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="French">French</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>
"""

soup = BeautifulSoup(texte, 'html.parser')

if soup.find('details[data-level="2"]:has(h2#English)'):  
    print('found')
else:
    print('not found')
Akira
  • 2,594
  • 3
  • 20
  • 45

3 Answers3

1

You can use find_all and then search what you wish for:

from bs4 import BeautifulSoup

texte = """
<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>     
        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="French">French</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>
"""

soup = BeautifulSoup(texte, 'html.parser')
details = soup.find_all("details", {"data-level": "2"})
lang = "English"
for detail in details:
    detail_str = str(detail)
    if lang in detail_str:
        print(detail)

Outputs:

<details data-level="2" open="">
<summary class="section-heading"><h2 id="English">English</h2></summary>
<details data-level="3" open="">abc</details>
</details>
David Meu
  • 1,527
  • 9
  • 14
1

As BeautifulSoup doesn't have xpath support, we can use lxml alternatively.

from lxml import html
texte = """
<div id="bodyContent" class="content mw-parser-output">
    <div id="mw-content-text" style="direction: ltr;">
        <h1 class="section-heading" tabindex="0" aria-haspopup="true" data-section-id="0">
            <span class="mw-headline" id="title_0">pomme</span>
        </h1>     
        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="English">English</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
        <details data-level="2" open="">
            <summary class="section-heading"><h2 id="French">French</h2></summary>
            <details data-level="3" open="">abc</details>
        </details>
    </div>
</div>
"""
tree = html.fromstring(texte)
element = tree.xpath('//details[@data-level="2"]//h2[contains(text(),"English")]')
if element:
    print("Found")
else:
    print("Not found")
George Imerlishvili
  • 1,816
  • 2
  • 12
  • 20
1

You can try select_one instead of find. Something like this.

soup.select_one('details[data-level="2"] summary.section-heading h2#English')

The result will be

<h2 id="English">English</h2>
  • And to answer "why find not working like this `soup.find('details[data-level="2"]:has(h2#English)')` ". Basically you can't search by CSS-selectors with it. [Here more info](https://beautiful-soup-4.readthedocs.io/en/latest/#searching-the-tree) – Antony Phoenix Apr 22 '21 at 09:12