0

I have extracted information from a website on conversations between various speakers. I have extracted the information in the form of a string and it contains HTML tags. I would like to extract the name of each speaker and what he/she said. An example string is:

<p>7 <strong>Assoc Prof Walter Theseira</strong> asked&nbsp;the Deputy Prime Minister and Minister for Finance (a) what has been the framework for Government's procurement of goods and services to address the COVID-19 pandemic; (b) what proportion of such contracts, by dollar value, have been issued through the standard Government procurement system on GeBIZ; (c) what are the considerations when issuing contracts otherwise; and (d) what steps are taken to ensure that all enterprises can participate in such procurement opportunities so as to ensure provision at competitive prices and quality.&nbsp;</p><p class=\"ql-align-justify\"><strong>\tThe Second Minister for Finance (Ms Indranee Rajah) (for the Deputy Prime Minister and Minister for Finance)</strong>: Mr Deputy Speaker, t<span style=\"color: black;\">he Government procurement framework calls for open sourcing through the GeBIZ platform as the default approach. However, the use of limited tenders or direct contracting is permitted under specific conditions, such as to protect public health, or for reasons of national security. These practices are aligned with international standards laid out in the World Trade Organisation’s Agreement on Government Procurement.</span></p><p class=\"ql-align-justify\"><span style=\"color: black;\">To address the rapidly evolving COVID-19 situation and avoid further worsening of the public health situation, Government agencies had to obtain necessary goods and services as quickly as possible. While the default approach continues to be open sourcing via GeBIZ, the urgency meant that, in some cases, it was not practical to do so. For such instances, the procedures under Emergency Procurement allow Government agencies to directly contract with suppliers who have the necessary expertise and resources, instead of going through open sourcing. For example, as we needed to quickly source for and fit out premises to house at-risk persons, and also secure essential medical supplies, the agencies concerned established direct contracts with the suppliers outside GeBIZ who were best able to meet the requirements within the shortest timeframe possible. </span></p><p class=\"ql-align-justify\"><span style=\"color: black;\">Similar Emergency Procurement practices are also adopted in other jurisdictions. For example, in Australia, the European Union, New Zealand, the United Kingdom and the United States, government agencies may directly award a contract without the need for open competition under emergency situations, such as the COVID-19 crisis. </span></p>

I'd like to extract the textual content of what Assoc Prof Walter Theseira said. This would be:

asked the Deputy Prime Minister and Minister for Finance (a) what has been the framework for Government's procurement of goods and services to address the COVID-19 pandemic; (b) what proportion of such contracts, by dollar value, have been issued through the standard Government procurement system on GeBIZ; (c) what are the considerations when issuing contracts otherwise; and (d) what steps are taken to ensure that all enterprises can participate in such procurement opportunities so as to ensure provision at competitive prices and quality

The same would apply for what the other speaker said. I will then store the information accordingly (speaker and respective content) for further analysis.

I've tried regex but it doesn't seem flexible and scalable enough, considering the various HTML tags I would have to account for. I'm wondering what other options there are for python.

mplungjan
  • 169,008
  • 28
  • 173
  • 236
  • do not [parse html by regex](https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) - use something that can understand html. – Patrick Artner Jul 08 '20 at 13:07

1 Answers1

0

I'd use BeautifulSoup python for such task.

Here's a quick example using your html.

from bs4 import BeautifulSoup

html = """
<p>7 <strong>Assoc Prof Walter Theseira</strong> asked&nbsp;the Deputy Prime Minister and Minister for Finance (a) what
    has been the framework for Government's procurement of goods and services to address the COVID-19 pandemic; (b) what
    proportion of such contracts, by dollar value, have been issued through the standard Government procurement system
    on GeBIZ; (c) what are the considerations when issuing contracts otherwise; and (d) what steps are taken to ensure
    that all enterprises can participate in such procurement opportunities so as to ensure provision at competitive
    prices and quality.&nbsp;</p>
<p class="ql-align-justify"><strong>\tThe Second Minister for Finance (Ms Indranee Rajah) (for the Deputy Prime
        Minister and Minister for Finance)</strong>: Mr Deputy Speaker, t<span style="color:black;">he Government
        procurement framework calls for open sourcing through the GeBIZ platform as the default approach. However, the
        use of limited tenders or direct contracting is permitted under specific conditions, such as to protect public
        health, or for reasons of national security. These practices are aligned with international standards laid out
        in the World Trade Organisation’s Agreement on Government Procurement.</span></p>
<p class="ql-align-justify"><span style="color: black;">To address the rapidly evolving COVID-19 situation and avoid
        further worsening of the public health situation, Government agencies had to obtain necessary goods and services
        as quickly as possible. While the default approach continues to be open sourcing via GeBIZ, the urgency meant
        that, in some cases, it was not practical to do so. For such instances, the procedures under Emergency
        Procurement allow Government agencies to directly contract with suppliers who have the necessary expertise and
        resources, instead of going through open sourcing. For example, as we needed to quickly source for and fit out
        premises to house at-risk persons, and also secure essential medical supplies, the agencies concerned
        established direct contracts with the suppliers outside GeBIZ who were best able to meet the requirements within
        the shortest timeframe possible. </span></p>
<p class="ql-align-justify"><span style="color: black;">Similar Emergency Procurement practices are also adopted in
        other jurisdictions. For example, in Australia, the European Union, New Zealand, the United Kingdom and the
        United States, government agencies may directly award a contract without the need for open competition under
        emergency situations, such as the COVID-19 crisis. </span></p>

        """


soup = BeautifulSoup(html, 'html.parser')


for i in soup.find_all("strong") :
    print(i.text)


the output would be like :

Assoc Prof Walter Theseira
        The Second Minister for Finance (Ms Indranee Rajah) (for the Deputy Prime
        Minister and Minister for Finance)


here's the documentation :

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Matar
  • 73
  • 1
  • 7