1

I use the following code to scrape the website:


import requests
from bs4 import BeautifulSoup
resp = requests.get('https://www.ecb.europa.eu/press/pressconf/2018/html/ecb.is180913.en.html')
soup = BeautifulSoup(resp.content, 'html5lib')
article = soup.find('article')
paragraphs = article.find_all('p')

The output look likes:

[<p>Based on our regular economic and monetary analyses, we decided to keep the <strong>key ECB interest rates</strong> unchanged. .... to levels that are below, but close to, 2% over the medium term.</p>,
<p><strong>Has QE been used well by the various euro area countries?</strong></p>,
 <p>By and large, yes, it's been used well in the sense that the intended effects of the QE – mind, ... It reduced dispersion in growth rates everywhere. An employment situation which is by and large improving almost everywhere, some countries more than others. </p>,
 <p>If your question is meant to say; shouldn't governments have taken advantage of the situation of such low rates to decrease budget deficits, to restore? ... is a good situation for doing that.</p>,
 <p><strong>My second question is on reinvestment. ...Have you today explicitly asked the committees to come up with proposals on reinvestments?</strong></p>,
 <p>About inflation: I said inflation is going to hover around the present level for the rest of the year and then I gave numbers for next year and 2020. ...will reach our objective over the medium term. </p>,]

I would like to exclude bold paragraph that contains

 <p><strong>

and has more than 15 words. The desired output should be:

[<p>Based on our regular economic and monetary analyses, we decided to keep the <strong>key ECB interest rates</strong> unchanged. .... to levels that are below, but close to, 2% over the medium term.</p>,
 <p>By and large, yes, it's been used well in the sense that the intended effects of the QE – mind, ... It reduced dispersion in growth rates everywhere. An employment situation which is by and large improving almost everywhere, some countries more than others. </p>,
 <p>If your question is meant to say; shouldn't governments have taken advantage of the situation of such low rates to decrease budget deficits, to restore? ... is a good situation for doing that.</p>,
 <p>About inflation: I said inflation is going to hover around the present level for the rest of the year and then I gave numbers for next year and 2020. ...will reach our objective over the medium term. </p>,]

I tried to code but failed to obtain the desired output. I would really appreciate if you could help me.

petezurich
  • 9,280
  • 9
  • 43
  • 57
Vinh Vo
  • 15
  • 3
  • Possible duplicate of [Exclude unwanted tag on Beautifulsoup Python](https://stackoverflow.com/questions/40760441/exclude-unwanted-tag-on-beautifulsoup-python) – petezurich Nov 08 '18 at 10:54
  • 1
    My question is probably a bit different or maybe my question is not too clear. The bold paragraph

    should have more than 15 words. For example,

    Thank you

    I do not exclude it.

    – Vinh Vo Nov 08 '18 at 14:16

2 Answers2

0

Try the extract() function:

article = soup.find('article')
paragraphs = article.find_all('p')

article.strong.extract()
paragraphs_without_bold = article.find_all('p')

See also this.

petezurich
  • 9,280
  • 9
  • 43
  • 57
0

use str() to convert bs4 object to string like <p><strong>......</strong></p>

....
paragraphs = article.find_all('p')

for p in paragraphs:
    if '<p><strong>' not in str(p):
        print str(p)
ewwink
  • 18,382
  • 2
  • 44
  • 54