using bs4 to find a html tag (h2) having text

Question

for this part of html code:

html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""

I am going to use beautifulsoup to find h2 that its text equals to "Content Logical Definition" and next siblings. But beautifulsoup can not find h2. The following is my code:

soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings

This is an error:

AttributeError: 'NoneType' object has no attribute 'nextsibilings'

There are several "h2" in the text, but the only character that makes this h2 unique is "Content Logical Definition". After finding this h2, I am going to extract data from the table and list under it.

Try `nextsiblings`??? – Scratch'N'Purr May 29 '16 at 15:11 — Scratch'N'Purr, May 29 '16 at 15:11

alecxe · Accepted Answer · 2016-05-31T00:49:40.787

5

The main problem is the way you are locating the h2 element to find siblings from. I'd use a function instead checking that Content Logical Definition is inside the text:

soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)

Also, to get the next siblings you should use the .next_siblings and not nextsibilings.

Demo:

>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
...     print(sibling)
... 
<hr/>
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>

Though, now knowing the real HTML you are dealing with and how messed up can it be, I think you should be iterating over the siblings, break on the next h2 or if you find a table before that. Actual implementation:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)

edited May 31 '16 at 00:49

answered May 29 '16 at 15:15

alecxe

462,703
120
1,088
1,195

Thank you !, my main purpose is to find "table" and "ul" as the next siblings. but after this part of the code "for sibling in h2.next_siblings:" if you write a code: "if sibling.name=="table": print "2" (as an example), it does not. It seems it does not consider or
– Mary May 29 '16 at 15:59
@Mary sure, I think you can just use `find_next()` this way: `table = h2.find_next("table")`. – alecxe May 29 '16 at 16:01
thank you so much !. Now problem is: it will find all tables after h2. While I just want table that is defined in "content logical definition" section. In other words, if no table is defined in the ""content logical definition", I want table in the code "table = h2.find_next("table")" be empty. What do you suggest ? thanks again ! – Mary May 29 '16 at 19:01
@Mary okay, to have a better context, could you provide a complete input HTML you have and a desired output? Thanks! – alecxe May 29 '16 at 19:32
@thanks alecxe !, this is the links for the two of the webpages I am trying to extract information. I just need information from "Content logical definition " section, so if no table was defined in this section, I am going to define "null" for all fields that was defined in the table: https://www.hl7.org/fhir/valueset-activity-reason.html , https://www.hl7.org/fhir/valueset-account-status.html – Mary May 29 '16 at 21:23
@Mary okay, I think the `h2.parent.select_one("table.codes")` should solve that. It would return `None` if there is no table defined where your `h2` label is. Let me if it helps or not. Thanks – alecxe May 30 '16 at 18:13
Thanks alecxe ! for this link: https://www.hl7.org/fhir/valueset-activity-reason.html ,'h2.parent.select_one("table.codes")' still returned table in the "expansion " section. – Mary May 30 '16 at 23:46
In addition, for this link: https://www.hl7.org/fhir/valueset-age-units.html that has a table in "content logical definition" it does not work. this code "h2.parent.select_one("table.codes")" returned None for this link. I highly appreciate your time. Thanks – Mary May 30 '16 at 23:50
@Mary gotcha, interesting case. Updated the answer - hope it helps! – alecxe May 31 '16 at 00:49
@ alecxe, great code !, It really works for the first url: https://www.hl7.org/fhir/valueset-activity-reason.html , however for the second url: https://www.hl7.org/fhir/valueset-age-units.html , the code captures table defined at the end of the page, not table in in the "content logical definition" section. Is it possible to edit the code in a way that captures a table defined exactly after h2 with the " Content Logical Definition", not at the end of the page? – Mary May 31 '16 at 16:01
@Mary okay, please consider creating a separate question about this problem if you experience difficulties - this way you can get more help. Comments are generally not the best place to solve follow-up issues. Thanks for understanding. Please throw me the link to the question here. – alecxe May 31 '16 at 16:02
sure, should I copy and paste your answer for the question and explain problem associated with each of them? – Mary May 31 '16 at 17:22
this is the link for the new question: http://stackoverflow.com/questions/37552550/access-to-a-specific-table-in-html-tag – Mary May 31 '16 at 17:57
@ alecxe , would be possible to answer a question posted here: http://stackoverflow.com/questions/38680057/beautiful-soup-just-extract-header-of-a-table – Mary Jul 31 '16 at 10:26

using bs4 to find a html tag (h2) having text

1 Answers1

Linked