Python Beautiful Soup print specific lines within multiline
containing string

Question

How can I get/ print only the lines of a big multiline text within one  tag containing a certain string? On the website the lines are realized with   tags. There is no closing  tag.

Basic structure of the website:

<p style="line-height: 150%">
I need a big cup of coffee and cookies.
<br>
I do not like tea with milk.
<br>
I can't live without coffee and cookies.
<br>
...

Let's assume I want to get/ print only the lines containing the words "coffee and cookies". So, in this case only the first and third "line"/ sentence of this  should be printed.

I have Beautiful Soup 4.6.3 installed under Python 3.7.1.

findAll seems to be tag-orientated and return the whole , right? So how can I realize it? Maybe with regex or other pattern?

ewwink · Answer 1 · 2018-11-20T13:00:33.387

0

convert bs4.element to string using str() then you can compare it with "coffee and cookies"

from bs4 import BeautifulSoup

html_doc = """<p style="line-height: 150%">
    I need a big cup of coffee and cookies. <a href="aaa">aa</a>
    <br>
    I do not like tea with milk.
    <br>
    I can't live without coffee and cookies.
    <br>"""

soup = BeautifulSoup(html_doc, 'html.parser')
paragraph = soup.find('p')

for p in paragraph:
  if 'coffee and cookies' in str(p):
    next_is_a = p.find_next_sibling('a')
    if next_is_a:
      print(p.strip() + ' ' + str(next_is_a))
    else:
      print(p.strip())

edited Nov 20 '18 at 13:00

answered Nov 13 '18 at 11:33

ewwink

18,382
2
44
54

I tried to use your code, the output is not the whole line but only: ) coffee and cookies – user3087516 Nov 13 '18 at 12:58
it return `I need a big cup of coffee and cookies.` and `I can't live without coffee and cookies.` try it https://repl.it/repls/YouthfulRashInsurance – ewwink Nov 13 '18 at 13:28
Thanks, for this example it works. What if we also have an -tag in the same line and it should be also printed out? – user3087516 Nov 20 '18 at 11:53
use `.find_next_sibling()` see above – ewwink Nov 20 '18 at 13:01

score 0 · Answer 2 · answered Nov 14 '18 at 13:10

If I could understand your requirement correctly then the following snippet should get you there:

from bs4 import BeautifulSoup

htmlelem = """
    <p style="line-height: 150%">
    I need a big cup of coffee and cookies.
    <br>
    I do not like tea with milk.
    <br>
    I can't live without coffee and cookies.
    <br>
"""

soup = BeautifulSoup(htmlelem, 'html.parser')
for paragraph in soup.find_all('p'):
    if not "coffee and cookies" in paragraph.text:continue
    print(paragraph.get_text(strip=True))

Thanks, for this example it works. What if we also have an -tag in the same line and it should be also printed out? — user3087516, Nov 20 '18 at 11:55

score 0 · Answer 3 · answered Nov 14 '18 at 14:45

0

Can you split on \n ?

from bs4 import BeautifulSoup

html = """
    <p style="line-height: 150%">
    I need a big cup of coffee and cookies.
    <br>
    I do not like tea with milk.
    <br>
    I can't live without coffee and cookies.
    <br>
"""

soup = BeautifulSoup(html, 'html.parser')
for item in soup.select('p'):
    r1 = item.text.split('\n')
    for nextItem in r1:
        if "coffee and cookies" in nextItem:
            print(nextItem)

answered Nov 14 '18 at 14:45

QHarr

83,427
12
54
101

Thanks, for this example it also works. What if we also have an -tag in the same line and it should be also printed out? – user3087516 Nov 20 '18 at 11:55
Like soup.select('p,a') ? That would gather p and a tag elements. – QHarr Nov 20 '18 at 12:15
this prints only the text from a, right? What if I need also the href value? – user3087516 Nov 20 '18 at 12:20
It will grab those elements then you would retrieve the href value from the a tag though it would be better to use soup.select('p, a[href]') https://stackoverflow.com/questions/1080411/retrieve-links-from-web-page-using-python-and-beautifulsoup – QHarr Nov 20 '18 at 12:22

Python Beautiful Soup print specific lines within multiline containing string

3 Answers3

Python Beautiful Soup print specific lines within multiline
containing string