extracting text from html is not working in python

Question

I have this html

<div style="padding-top: 10px;" id="government_funding">
    <h2>Sampling of Recent Funding Actions/Set Asides</h2>
    <p style="font-style: italic; font-size: .8em;">In order by amount of set aside monies.</p>
    <ul>
        <li><span style="color: green;">$14,450</span> - Thursday the 17th of August 2017<br><span style="font-weight: bold; font-size: 1.2em;">National Institutes Of Health</span> <br> NATIONAL INSTITUTES OF HEALTH NICHD<br>AVANTI POLAR LIPIDS:1109394 [17-010744]
            <hr>
        </li>
        <li><span style="color: green;">$5,455</span> - Thursday the 31st of August 2017<br><span style="font-weight: bold; font-size: 1.2em;">National Institutes Of Health</span> <br> NATIONAL INSTITUTES OF HEALTH NICHD<br>AVANTI POLAR LIPIDS:1109394 [17-004567]
            <hr>
        </li>
        <li><span style="color: green;">$5,005</span> - Tuesday the 8th of August 2017<br><span style="font-weight: bold; font-size: 1.2em;">National Institutes Of Health</span> <br> NATIONAL INSTITUTES OF HEALTH NIAID<br>CUSTOM LIPID SYNTHESIS (24:0-10:0 PE) 100 MG PACKAGED IN 10-10MG VIALS POWDER PER QUOTE #DQ-000665
            <hr>
        </li>
        <li><span style="color: green;">$5,005</span> - Thursday the 17th of August 2017<br><span style="font-weight: bold; font-size: 1.2em;">National Institutes Of Health</span> <br> NATIONAL INSTITUTES OF HEALTH NIAID<br>CUSTOM LIPID SYNTHESIS (24:0-10:0 PE) 100 MG PACKAGED IN 10-10MG VIALS POWDER PER QUOTE #DQ-000665
            <hr>
        </li>
    </ul>
</div>

I currently use this script to retrieve the text in the span tag

def all_data(d): a, b = [i.text for i in d.find_all('span')]

return [a, *re.findall('\w+\sthe\s\w+\sof\s\w+\s\d+', d.text), b]

fundresults = [all_data(b) for b in businessesoup.find('div', {'id':'government_funding'}).find_all('li')]

for fundingItem in fundresults:
    fundingPrice = fundingItem[0]
    fundingDate = fundingItem[1]
    fundingAgency = fundingItem[2]

this works but I could not find a way to extract the last two lines of text from the html. for example extracting this text from the first li

NATIONAL INSTITUTES OF HEALTH NICHD AVANTI POLAR LIPIDS:1109394 [17-010744]

how could I extract the text that is not in the span tag?

You may look into using [beautiful soup](https://pypi.org/project/beautifulsoup4/) for html parsing — G. Anderson, Oct 12 '18 at 21:46
Please look at this famous stackoverflow question: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Danyal, Oct 12 '18 at 21:57

extracting text from html is not working in python

0 Answers0