0

From the following html,

html = '''
<td>the keyword is present in the <a href='text' title='text'>text</a> </td>
<td>word key is not present</td>
<td>no keyword here</td>'''

I want to find the strings that that the word "keyword" them.

So in this example, I want to find

<td>the keyword is present in the <a href='text' title='text'>text</a> </td>
<td>no keyword here</td>

So I tried:

soup = BeautifulSoup(html, 'lxml')
ans = soup.find_all('td', text=lambda l: l and 'keyword' in l)
print(ans)
# [<td>no keyword here</td>]

But this doesn't return the other line that has "keyword" in it. How do I go about it?

3 Answers3

3

You can try using :contains with CSS Selectors

html = '''
<td>the keyword is present in the <a href='text' title='text'>text</a> </td>
<td>word key is not present</td>
<td>no keyword here</td>'''
soup = BeautifulSoup(html)
print(soup.select('td:contains("keyword")'))

>>> [<td>the keyword is present in the <a href="text" title="text">text</a> </td>, 
    <td>no keyword here</td>]

EDIT

With new version of BS4 :contains has been deprecated, You can try using -soup:contains() or -soup:contains-own().

from bs4 import BeautifulSoup as bs
html = """<table><tr>
<td>the keyword is present in the <a href='text' title='text'>text</a> </td>
<td>word key is not present</td>
<td>no keyword here</td>
</table>"""
soup = bs(html)
variable = "keyword"
print(soup.select(f'td:-soup-contains({variable})'.format(variable)))

The above variable can be passed through command line (Reference).

import argparse
parser=argparse.ArgumentParser()
parser.add_argument('--keyword', help='Add some keyword to search')
args=parser.parse_args()
keyword = args.keyword
from bs4 import BeautifulSoup as bs
html = """<table><tr>
<td>the keyword is present in the <a href='text' title='text'>text</a> </td>
<td>word key is not present</td>
<td>no keyword here</td></tr>
</table>"""
soup = bs(html,'html5lib')
print(soup.select(f'td:-soup-contains({keyword})'))
Rinshan Kolayil
  • 1,111
  • 1
  • 9
  • 14
  • say, soup.select is in a function and "keyword" has to come from an argument, how do we do the same job? e.g. html = '''

    the keyword string is present in the text

    the_word is here

    ''' the_word = 'keyword string' soup = BeautifulSoup(html) print(soup.select('p:contains(the_word)')) # it prints [the_word is here] when we wanted [

    the keyword string is present in the text

    ]
    –  Sep 11 '21 at 16:38
  • I'm getting NotImplementedError. Does it mean BeautifulSoup on my machine is not up to date? –  Sep 11 '21 at 17:28
  • Please have a look at https://stackoverflow.com/questions/34553622/dealing-with-a-colon-in-beautifulsoup-css-selectors – Rinshan Kolayil Sep 11 '21 at 18:15
0

Do you mean by something like?

html = '''
<td>the keyword is present in the <a href='text' title='text'>text</a> </td>
<td>word key is not present</td>
<td>no keyword here</td>'''
print('\n'.join(line for line in html.splitlines() if 'keyword' in line))

Output:

<td>the keyword is present in the <a href='text' title='text'>text</a> </td>
<td>no keyword here</td>
U13-Forward
  • 69,221
  • 14
  • 89
  • 114
0

Try this:

from bs4 import BeautifulSoup

html = '''
<td>the keyword is present in the <a href='text' title='text'>text</a> </td>
<td>word key is not present</td>
<td>no keyword here</td>'''

soup = BeautifulSoup(html , 'html.parser')
print(*[td for td in soup.find_all("td") if 'keyword' in td.text], sep='\n')

Output:

<td>the keyword is present in the <a href="text" title="text">text</a> </td>
<td>no keyword here</td>

You can use td.text for get text in <td> like below:

print(*[td.text for td in soup.find_all("td") if 'keyword' in td.text], sep='\n')

Output:

the keyword is present in the text 
no keyword here
I'mahdi
  • 23,382
  • 5
  • 22
  • 30