Using BeautifulSoup to search HTML for string

Question

I am using BeautifulSoup to look for user-entered strings on a specific page. For example, I want to see if the string 'Python' is located on the page: http://python.org

When I used: find_string = soup.body.findAll(text='Python'), find_string returned []

But when I used: find_string = soup.body.findAll(text=re.compile('Python'), limit=1), find_string returned [u'Python Jobs'] as expected

What is the difference between these two statements that makes the second statement work when there are more than one instances of the word to be searched?

score 86 · Accepted Answer · answered Jan 20 '12 at 02:57

86

The following line is looking for the exact NavigableString 'Python':

>>> soup.body.findAll(text='Python')
[]

Note that the following NavigableString is found:

>>> soup.body.findAll(text='Python Jobs') 
[u'Python Jobs']

Note this behaviour:

>>> import re
>>> soup.body.findAll(text=re.compile('^Python$'))
[]

So your regexp is looking for an occurrence of 'Python' not the exact match to the NavigableString 'Python'.

answered Jan 20 '12 at 02:57

sgallen

2,079
13
10

7

Is is possible to get the parent tag of a specific text? – Samay Feb 13 '18 at 11:49
6

@Samay `soup.find(text='Python Jobs').parent` — from docs: ["Going up"](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-up) – Denis Nov 15 '19 at 16:29

jfs · Answer 2 · 2016-03-07T00:43:49.663

38

text='Python' searches for elements that have the exact text you provided:

import re
from BeautifulSoup import BeautifulSoup

html = """<p>exact text</p>
   <p>almost exact text</p>"""
soup = BeautifulSoup(html)
print soup(text='exact text')
print soup(text=re.compile('exact text'))

Output

[u'exact text']
[u'exact text', u'almost exact text']

"To see if the string 'Python' is located on the page http://python.org":

import urllib2
html = urllib2.urlopen('http://python.org').read()
print 'Python' in html # -> True

If you need to find a position of substring within a string you could do html.find('Python').

edited Mar 07 '16 at 00:43

answered Jan 20 '12 at 02:56

jfs

399,953
195
994
1,670

Is it possible to find all occurrences of the string Python, not just one? – Timo Jan 27 '21 at 17:37
1

@Timo https://stackoverflow.com/questions/4664850/how-to-find-all-occurrences-of-a-substring – jfs Jan 27 '21 at 18:42
[m.start() for m in re.finditer('test',soup')] ? I am lost.. – Timo Jan 27 '21 at 19:19
1

@Timo copy the code from [the accepted answer to the StackOverflow question I've linked](https://stackoverflow.com/a/4664889/4279). Make sure the code fragment works in your environment. Start changing it to your task (one simple change at a time). Once it breaks (when it does something unexpected for you), use it as [the minimal reproducible code example to ask a new StackOverflow question](https://stackoverflow.com/help/minimal-reproducible-example) – jfs Jan 28 '21 at 16:28

MendelG · Answer 3 · 2022-07-05T19:16:06.967

12

In addition to the accepted answer. You can use a lambda instead of regex:

from bs4 import BeautifulSoup

html = """<p>test python</p>"""

soup = BeautifulSoup(html, "html.parser")

print(soup(text="python"))
print(soup(text=lambda t: "python" in t.text))

Output:

[]
['test python']

edited Jul 05 '22 at 19:16

answered Aug 25 '20 at 04:07

MendelG

14,885
4
25
52

Bit Bucket · Answer 4 · 2012-01-20T02:53:13.307

3

I have not used BeuatifulSoup but maybe the following can help in some tiny way.

import re
import urllib2
stuff = urllib2.urlopen(your_url_goes_here).read()  # stuff will contain the *entire* page

# Replace the string Python with your desired regex
results = re.findall('(Python)',stuff)

for i in results:
    print i

I'm not suggesting this is a replacement but maybe you can glean some value in the concept until a direct answer comes along.

edited Jan 20 '12 at 02:53

answered Jan 20 '12 at 02:47

Bit Bucket

942
4
10
13

Googlers see https://stackoverflow.com/questions/34475051/need-to-install-urllib2-for-python-3-5-1 for an modern update. – msanford Jun 30 '20 at 15:18

Using BeautifulSoup to search HTML for string

4 Answers4

Output

Linked

Related