0

I am going to use beautifulsoup to find a table that defined in the “content logical definition” in the following links:

1) https://www.hl7.org/fhir/valueset-account-status.html
2) https://www.hl7.org/fhir/valueset-activity-reason.html
3) https://www.hl7.org/fhir/valueset-age-units.html 

Several tables may be defined in the pages. The table I want is located under <h2> tag with text “content logical definition”. Some of the pages may lack of any table in the “content logical definition” section, so I want the table to be null. By now I tried several solution, but each of them return wrong table for some of the pages.

The last solution that was offered by alecxe is this:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]

for url in urls:
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == "h2" and "Content Logical Definition" in elm.text)
    table = None
    for sibling in h2.find_next_siblings():
        if sibling.name == "table":
            table = sibling
            break
        if sibling.name == "h2":
            break
    print(table)

This solution returns null if no table is located in the section of “content logical definition” but for the second url having table in “content logical definition” it returns wrong table, a table at the end of the page.
How can I edit this code to access a table defined exactly after tag having text of “content logical definition”, and if there is no table in this section it returns null.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Mary
  • 1,142
  • 1
  • 16
  • 37

1 Answers1

0

It looks like the problem with alecxe's code is that it returns a table that is a direct sibling of h2, but the one you want is actually within a div (which is h2's sibling). This worked for me:

import requests
from bs4 import BeautifulSoup

urls = [
    'https://www.hl7.org/fhir/valueset-account-status.html',
    'https://www.hl7.org/fhir/valueset-activity-reason.html',
    'https://www.hl7.org/fhir/valueset-age-units.html'
]


def extract_table(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'lxml')

    h2 = soup.find(lambda elm: elm.name == 'h2' and 'Content Logical Definition' in elm.text)
    div = h2.find_next_sibling('div')
    return div.find('table')


for url in urls:
    print extract_table(url)
Noah
  • 1,329
  • 11
  • 21
  • @ Noah, Thank you so much !, it is a great code, works very well. Would you please check my another question in the following link: "http://stackoverflow.com/questions/37555709/beautiful-soup-captures-null-values-in-a-table". Thank you again ! – Mary Jun 01 '16 at 00:00
  • @ Padraic, thanks for your comment ! So what do you suggest? – Mary Jun 03 '16 at 01:05