You don't want to search every tag, you can select the span tags that contain the text and filter using in, you can use a css selector to select the tags. What you want is the text inside span class="Text Intro Justify"
:
base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"
from bs4 import BeautifulSoup
import requests
soup = BeautifulSoup(requests.get(base_url).content)
text = [t.text for t in soup.select('div span.Text.Intro.Justify') if "agricultural" in t.text]
Which will give you:
['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']
If you want to match case insensitive you would need to if "agricultural" in t.text.lower()
Also if you want exact matches you would need to split the text or use a regex with word boundaries or you could end up getting false positives for certain words.
soup = BeautifulSoup(requests.get(base_url).content)
import re
# look for exact word
r = re.compile(r"\bagricultural\b", re.I)
text = [t.text for t in soup.find_all('span', {"class":'Text.Intro Justify'},text=r) ]
Using re.I
will match both agricultural
and Agricultural
.
Using word boundaries means you would not match "foo"
if the string contained "foobar"
.
Regardless of the approach you take, once you know the specific tags you want to search for you should search only for those, searching every tag may mean you get matches that are completely unrelated to what you actually want.
If you have a lot of parsing to do like above where you are filtering by text, you may find lxml very powerful, using an xpath expression we can filter very easily:
base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"
from lxml.etree import fromstring, HTMLParser
import requests
r = requests.get(base_url).content
xml = fromstring(r, HTMLParser())
print(xml.xpath("//span[@class='Text Intro Justify' and contains(text(),'agricultural')]//text()"))
Which gives you:
['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']
For an upper or lower case match with the xpath, we need to translate A to a:
(xml.xpath("//span[@class='Text Intro Justify' and contains(translate(text(), 'A','a'), 'agricultural')]//text()")
The \u201
you see are the repr
output for “
, when you actually print the strings you will see the str
output.
In [3]: s = u"Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture."
In [4]: print(s)
Whenever the terms “agriculture,” “agricultural purposes,” “agricultural uses,” or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.