1

for this part of html code:

html3= """<a name="definition"> </a>
<h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
<hr/>
<div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
<p> </p>"""

I am going to use beautifulsoup to find h2 that its text equals to "Content Logical Definition" and next siblings. But beautifulsoup can not find h2. The following is my code:

soup = BeautifulSoup(html3, "lxml")
f= soup.find("h2", text = "Content Logical Definition").nextsibilings

This is an error:

AttributeError: 'NoneType' object has no attribute 'nextsibilings'

There are several "h2" in the text, but the only character that makes this h2 unique is "Content Logical Definition". After finding this h2, I am going to extract data from the table and list under it.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Mary
  • 1,142
  • 1
  • 16
  • 37

1 Answers1

5

The main problem is the way you are locating the h2 element to find siblings from. I'd use a function instead checking that Content Logical Definition is inside the text:

soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)

Also, to get the next siblings you should use the .next_siblings and not nextsibilings.

Demo:

>>> from bs4 import BeautifulSoup
>>> html3= """<a name="definition"> </a>
... <h2><span class="sectioncount">3.342.2323</span> Content Logical Definition <a title="link to here" class="self-link" href="valueset-investigation"><img src="ta.png"/></a></h2>
... <hr/>
... <div><p from the following </p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td/></tr><tr><td>35453453453</td><td>History/symptoms</td><td/></tr></table></li></ul></div>
... <p> </p>"""
>>> soup = BeautifulSoup(html3, "lxml")
>>> h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
>>> for sibling in h2.next_siblings:
...     print(sibling)
... 
<hr/>
<div><p following="" from="" the=""></p><ul><li>Include these codes as defined in http://snomed.info/sct<table><tr><td><b>Code</b></td><td><b>Display</b></td></tr><tr><td>34353553</td><td>Examination / signs</td><td></td></tr><tr><td>35453453453</td><td>History/symptoms</td><td></td></tr></table></li></ul></div>
<p> </p>

Though, now knowing the real HTML you are dealing with and how messed up can it be, I think you should be iterating over the siblings, break on the next h2 or if you find a table before that. Actual implementation:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thank you !, my main purpose is to find "table" and "ul" as the next siblings. but after this part of the code "for sibling in h2.next_siblings:" if you write a code: "if sibling.name=="table": print "2" (as an example), it does not. It seems it does not consider or
      as the next siblings. However if works for "div".
    – Mary May 29 '16 at 15:59
  • @Mary sure, I think you can just use `find_next()` this way: `table = h2.find_next("table")`. – alecxe May 29 '16 at 16:01
  • thank you so much !. Now problem is: it will find all tables after h2. While I just want table that is defined in "content logical definition" section. In other words, if no table is defined in the ""content logical definition", I want table in the code "table = h2.find_next("table")" be empty. What do you suggest ? thanks again ! – Mary May 29 '16 at 19:01
  • @Mary okay, to have a better context, could you provide a complete input HTML you have and a desired output? Thanks! – alecxe May 29 '16 at 19:32
  • @thanks alecxe !, this is the links for the two of the webpages I am trying to extract information. I just need information from "Content logical definition " section, so if no table was defined in this section, I am going to define "null" for all fields that was defined in the table: https://www.hl7.org/fhir/valueset-activity-reason.html , https://www.hl7.org/fhir/valueset-account-status.html – Mary May 29 '16 at 21:23
  • @Mary okay, I think the `h2.parent.select_one("table.codes")` should solve that. It would return `None` if there is no table defined where your `h2` label is. Let me if it helps or not. Thanks – alecxe May 30 '16 at 18:13
  • Thanks alecxe ! for this link: https://www.hl7.org/fhir/valueset-activity-reason.html ,'h2.parent.select_one("table.codes")' still returned table in the "expansion " section. – Mary May 30 '16 at 23:46
  • In addition, for this link: https://www.hl7.org/fhir/valueset-age-units.html that has a table in "content logical definition" it does not work. this code "h2.parent.select_one("table.codes")" returned None for this link. I highly appreciate your time. Thanks – Mary May 30 '16 at 23:50
  • @Mary gotcha, interesting case. Updated the answer - hope it helps! – alecxe May 31 '16 at 00:49
  • @ alecxe, great code !, It really works for the first url: https://www.hl7.org/fhir/valueset-activity-reason.html , however for the second url: https://www.hl7.org/fhir/valueset-age-units.html , the code captures table defined at the end of the page, not table in in the "content logical definition" section. Is it possible to edit the code in a way that captures a table defined exactly after h2 with the " Content Logical Definition", not at the end of the page? – Mary May 31 '16 at 16:01
  • @Mary okay, please consider creating a separate question about this problem if you experience difficulties - this way you can get more help. Comments are generally not the best place to solve follow-up issues. Thanks for understanding. Please throw me the link to the question here. – alecxe May 31 '16 at 16:02
  • sure, should I copy and paste your answer for the question and explain problem associated with each of them? – Mary May 31 '16 at 17:22
  • this is the link for the new question: http://stackoverflow.com/questions/37552550/access-to-a-specific-table-in-html-tag – Mary May 31 '16 at 17:57
  • @ alecxe , would be possible to answer a question posted here: http://stackoverflow.com/questions/38680057/beautiful-soup-just-extract-header-of-a-table – Mary Jul 31 '16 at 10:26