Python scraping with BeautifulSoup, Only scrape paragraphs with certain word in it

Question

So I was able to scrape the whole chapter of statutes from code below. However, lets say if I only want to scrape the paragraph with word "agricultural" in it. How do I do that?

from bs4 import BeautifulSoup
import requests
import re

f = open('C:\Python27\projects\Florida\FL_finalexact.doc','w')

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter{chapter:03d}/All"

for chapter in range (1,40):  
  url = base_url.format(chapter=chapter)
  try:
    r = requests.get(url)
  except requests.exceptions.RequestException as e:   
      print "missing url"
      print e
      sys.exit(1)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })

  if tableContents is not None:
     for title in tableContents.find_all ('div', {'class': 'Title' }):
      f.write ('\n\n' + title.text + '\n\n' )

     for data in tableContents.find_all ('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n" + str(data)+ "\n" 
      f.write(data)

Do I need to use regular expression for this task?

Try to copy-paste this into a text file and execute. You'll get to know — Anmol Singh Jaggi, Apr 03 '16 at 07:44

Akshat Mahajan · Accepted Answer · 2016-04-03T08:08:00.810

4

You don't need a regular expression. BeautifulSoup is more powerful than that:

soup = BeautifulSoup(r.content)
soup.find_all(lambda tag: "agricultural" in tag.string if tag.string else False)

is sufficient to give you a list of all elements which contain the word "agricultural" in it. You can then iterate through the list and pull out the relevant strings:

results = soup.find_all(...) # function as before
scraped_paragraphs = map(lambda element: element.string, results)

and then write the elements in scraped_paragraphs wherever you will.

How This Works

BeautifulSoup supports a find_all() feature that will return all tags that match a particular criterion fed in to find_all(). This criterion can take on the form of a regular expression, a function, a list or even just True. In this case, a suitable boolean function is enough.

More importantly, however, every HTML tag in soup is indexed by a variety of attributes. You can query an HTML tag for attributes, children, siblings, and, of course, a contained inner text marked by string.

What this solution does is simply filter through the parsed HTML for all those elements whose string contains "agricultural" in it. Because not every element has a string attribute, it is necessary to make sure we first check that it has one - hence why we do if tag.string and return False if not found.

An Example

Here's what it looks like for Chapter001:

soup.find_all(lambda tag: "agricultural" in tag.string if tag.string else False)
>>>> [<span class="Text Intro Justify" xml:space="preserve">Crude turpentine gum (oleoresin), the product of a living tree or trees of the
     pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural 
     products, farm products, and agricultural commodities.</span>, 
     <span class="Text Intro Justify" xml:space="preserve">Whenever the terms “agriculture,” “agricultural purposes,” “agricultural uses,” or 
     words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; 
     aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; 
     and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.
     </span>]

Calling the map function on results yields the inner strings without accompanying span elements and nasty attributes:

map(lambda element : element.string, soup.find_all(...)
>>>> [u'Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', 
      u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

edited Apr 03 '16 at 08:08

answered Apr 03 '16 at 07:57

Akshat Mahajan

9,543
4
35
44

thank you for the additional explanation. the code works perfect. – CHballer Apr 03 '16 at 08:12
What exactly does the lambda function do? – CHballer Apr 03 '16 at 08:13
A lambda function is an anonymous function - it lets you define functions in a line without requiring to name it anything. In terms of what it does here, `soup.find_all()` walks through all of its tags and passes `tag` to the lambda function. If the function returns `True`, soup keeps it - if it returns False, soup moves on. – Akshat Mahajan Apr 03 '16 at 08:14
@TianMa: Here's a [nice tutorial](http://www.diveintopython.net/power_of_introspection/lambda_functions.html) on lambda functions. – Akshat Mahajan Apr 03 '16 at 08:15
re.compile is often use with find and find_all. using a lambda is probably going to be as slow, if the OP wanted exact matches using in will also return false positives – Padraic Cunningham Apr 03 '16 at 23:45
@AkshatMahajan, So i was able to write it into a word doc, but the ouput still in repr format, all the u' and \u201 are still within the paragraph. I tried to format it as str(scraped_paragraphs), but it didnt do the trick. any suggestions? – CHballer Apr 04 '16 at 01:02
@TianMa: You want to encode your unicode string in a suitable format, like UTF-8. See [this](http://stackoverflow.com/questions/5483423/how-to-write-unicode-strings-into-a-file) for an example. Basically, if `string` is your string, you want to write to the file `string.encode("UTF-8")`. – Akshat Mahajan Apr 04 '16 at 01:07
line 1: `scraped_paragraphs = map(lambda element: element.string, results)` line 2: `scraped_paragraphs = "\n" + str(scraped_paragraphs)+ "\n"` line 3: `f.write(scraped_paragraphs.encode("utf-8"))` this is the code i used, I still get repr in my result, or if add `scraped_paragraphs = scraped_paragraphs.text.encode("utf-8","ignore")` I will get error of list object has no attribute 'encode ' – CHballer Apr 04 '16 at 01:29
@TianMa: Found you a [solution](http://stackoverflow.com/a/35536228/2271269). Use the Python library `unidecode`. I tested it on the example here, and it works very, very well. – Akshat Mahajan Apr 04 '16 at 01:35
@AkshatMahajan, So I unidecode scraped_paragraphs first then encoded it to unf-8? – CHballer Apr 04 '16 at 02:32
@TianMa No, just unidecode it. The link I gave you has an example. – Akshat Mahajan Apr 04 '16 at 02:55
@AkshatMahajan the link shows how to unidecode a specific character, what if i want to unidecode the whole scraped_paragraphs? I tried unidecode(scraped_paragraphs), but got error instead. – CHballer Apr 04 '16 at 06:16
@TianMa This comment section is getting too long, and is no longer about the main question. Why not ask this specific encoding issue as a separate question instead? You'll get more responses, better responses, and other people will find it useful down the line too. – Akshat Mahajan Apr 04 '16 at 06:18
@AkshatMahajan, you are right, I finally got it to work, thanks for helps! – CHballer Apr 04 '16 at 06:44

Padraic Cunningham · Answer 2 · 2016-04-05T09:41:47.833

You don't want to search every tag, you can select the span tags that contain the text and filter using in, you can use a css selector to select the tags. What you want is the text inside span class="Text Intro Justify":

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"

from bs4 import BeautifulSoup
import requests

soup = BeautifulSoup(requests.get(base_url).content)

text = [t.text for t in soup.select('div span.Text.Intro.Justify') if "agricultural" in t.text]

Which will give you:

['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

If you want to match case insensitive you would need to if "agricultural" in t.text.lower()

Also if you want exact matches you would need to split the text or use a regex with word boundaries or you could end up getting false positives for certain words.

soup = BeautifulSoup(requests.get(base_url).content)
import re

# look for exact word
r = re.compile(r"\bagricultural\b", re.I)
text = [t.text for t in soup.find_all('span', {"class":'Text.Intro Justify'},text=r) ]

Using re.I will match both agricultural and Agricultural.

Using word boundaries means you would not match "foo" if the string contained "foobar".

Regardless of the approach you take, once you know the specific tags you want to search for you should search only for those, searching every tag may mean you get matches that are completely unrelated to what you actually want.

If you have a lot of parsing to do like above where you are filtering by text, you may find lxml very powerful, using an xpath expression we can filter very easily:

base_url = "http://www.flsenate.gov/Laws/Statutes/2015/Chapter001/All"

from lxml.etree import fromstring, HTMLParser
import requests
r = requests.get(base_url).content
xml = fromstring(r, HTMLParser())

print(xml.xpath("//span[@class='Text Intro Justify' and contains(text(),'agricultural')]//text()"))

Which gives you:

['Crude turpentine gum (oleoresin), the product of a living tree or trees of the pine species, and gum-spirits-of-turpentine and gum resin as processed therefrom, shall be taken and understood to be agricultural products, farm products, and agricultural commodities.', u'Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.']

For an upper or lower case match with the xpath, we need to translate A to a:

(xml.xpath("//span[@class='Text Intro Justify' and  contains(translate(text(), 'A','a'), 'agricultural')]//text()")

The \u201 you see are the repr output for “, when you actually print the strings you will see the str output.

In [3]: s = u"Whenever the terms \u201cagriculture,\u201d \u201cagricultural purposes,\u201d \u201cagricultural uses,\u201d or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture."

In [4]: print(s)
Whenever the terms “agriculture,” “agricultural purposes,” “agricultural uses,” or words of similar import are used in any of the statutes of the state, such terms include aquaculture, horticulture, and floriculture; aquacultural purposes, horticultural purposes, and floricultural purposes; aquacultural uses, horticultural uses, and floricultural uses; and words of similar import applicable to agriculture are likewise applicable to aquaculture, horticulture, and floriculture.

One thing I notice is that in your print result, there are u' and\u201, \u201d, mixed in with texts, I assume these are problems of encoding? I tried to to encode them by unf-8 but got a error as result. — CHballer, Apr 03 '16 at 23:51
@TianMa, that is just `repr` output you are seeing, when you print the actual string you will see the correct output, I added it to the answer. — Padraic Cunningham, Apr 03 '16 at 23:54
so i noticed that you said if I want to ignore case i would need to use `in t.text.lower()`, does that mean the word "agricultural" can be both Capitalized or in low cases? — CHballer, Apr 05 '16 at 04:57
should I use regex, if i want to capture both "agricultural" and "Agricultural" ? how to make it case-insensitive. — CHballer, Apr 05 '16 at 04:57
@TianMa, yes, calling lower will match any case `"agricultural"` and `"Agricultural" `, using the `re.I` will also match both — Padraic Cunningham, Apr 05 '16 at 09:28

Python scraping with BeautifulSoup, Only scrape paragraphs with certain word in it

2 Answers2

How This Works

An Example