Extracting text from specific paragraphs of the website with Python 2

Question

I want to extract the paragraphs that give the list of industries that are reporting growth and contraction and what the respondents are saying etc. (This can be found in several locations of the webpage). These paragraphs usually come just above the table. How do I use Requests, lxml, BeautifulSoup to parse and select the paragraphs that I need?

https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655

I tried using lxml and xpath, but each month the website changes slightly with the new report, and the code stopped working.

Ettore Rizza · Answer 1 · 2017-02-26T19:45:19.730

A third solution is to use Pyquery. It is fast and it uses exactly the same selectors as Jquery. You can find them easily by using the Chrome Gadget Selector.

Then, it remains only to use it.

from pyquery import PyQuery as pq
import requests

url = "https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1"
content = requests.get(url).content
doc = pq(content)

respondent = doc(".formatted_content ul").text()

print(respondent)

Output:

“Demand very steady to start the year.” (Chemical Products) “January revenue target slightly lower following a big December shipment month.” (Computer & Electronic Products) “Strong start to the new year. Production is increasing and we are adding capacity.” (Plastics & Rubber Products) “Business looks stronger moving into the first quarter of 2017.” (Primary Metals) “Economic outlook remains stable and no current effects of geopolitical changes appear to be penetrating market conditions.” (Food, Beverage & Tobacco Products) “Sales bookings are exceeding expectations. We are starting to see supply shortages in hot rolled steel due to the curtailment of imports.” (Machinery) “Year starting on pace with Q4 2016.” (Transportation Equipment) “Business conditions are good, demand is generally increasing.” (Miscellaneous Manufacturing) “Conditions and outlook remain positive. Raw material prices are stable resulting in stable margins. Asset utilization remains high.” (Petroleum & Coal Products) “Steady demand from automotive.” (Fabricated Metal Products)

Is it possible to get same in string form instead of text. Suppose I have to extract substring etc? — prashanth manohar, May 25 '21 at 18:39

score 1 · Accepted Answer · answered Feb 26 '17 at 17:19

How close is this code to what you have been using?

It identifies the paragraphs using a regex, the line preceding the list of things being said by respondents too. Then it just displays the results.

>>> import requests
>>> URL = 'https://www.instituteforsupplymanagement.org/about/MediaRoom/newsreleasedetail.cfm?ItemNumber=30655&SSO=1'
>>> r = requests.get(URL)
>>> page = r.text
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(page, 'lxml')
>>> import re
>>> paras = soup.find_all('p', string=re.compile('(?:growth)|(?:contraction).*? are\:'))
>>> saying = soup.find_all('strong', string=re.compile('WHAT RESPONDENTS ARE SAYING'))[0]
>>> for i, para in enumerate(paras):
...     'Paragraph ', i
...     para
...     
('Paragraph ', 0)
<p>Of the 18 manufacturing industries, 12 reported growth in January in the following order: Plastics &amp; Rubber Products; Miscellaneous Manufacturing; Apparel, Leather &amp; Allied Products; Paper Products; Chemical Products; Transportation Equipment; Food, Beverage &amp; Tobacco Products; Machinery; Petroleum &amp; Coal Products; Primary Metals; Fabricated Metal Products; and Computer &amp; Electronic Products. The five industries reporting contraction in January are: Nonmetallic Mineral Products; Wood Products; Furniture &amp; Related Products; Electrical Equipment, Appliances &amp; Components; and Printing &amp; Related Support Activities.</p>
('Paragraph ', 1)
<p>The 12 industries reporting growth in new orders in January — listed in order — are: Plastics &amp; Rubber Products; Apparel, Leather &amp; Allied Products; Miscellaneous Manufacturing; Chemical Products; Paper Products; Transportation Equipment; Electrical Equipment, Appliances &amp; Components; Petroleum &amp; Coal Products; Primary Metals; Machinery; Fabricated Metal Products; and Food, Beverage &amp; Tobacco Products. The five industries reporting a decrease in new orders during January are: Nonmetallic Mineral Products; Wood Products; Textile Mills; Computer &amp; Electronic Products; and Furniture &amp; Related Products.</p>
('Paragraph ', 2)
<p>The 10 industries reporting growth in production during the month of January — listed in order — are: Miscellaneous Manufacturing; Apparel, Leather &amp; Allied Products; Paper Products; Petroleum &amp; Coal Products; Plastics &amp; Rubber Products; Transportation Equipment; Chemical Products; Machinery; Food, Beverage &amp; Tobacco Products; and Computer &amp; Electronic Products. The five industries reporting a decrease in production during January are: Wood Products; Textile Mills; Nonmetallic Mineral Products; Electrical Equipment, Appliances &amp; Components; and Furniture &amp; Related Products.</p>
('Paragraph ', 3)
<p>Of the 18 manufacturing industries, the 10 reporting employment growth in January — listed in order — are: Textile Mills; Paper Products; Food, Beverage &amp; Tobacco Products; Machinery; Electrical Equipment, Appliances &amp; Components; Chemical Products; Miscellaneous Manufacturing; Transportation Equipment; Computer &amp; Electronic Products; and Nonmetallic Mineral Products. The five industries reporting a decrease in employment in January are: Plastics &amp; Rubber Products; Petroleum &amp; Coal Products; Primary Metals; Fabricated Metal Products; and Printing &amp; Related Support Activities. </p>
('Paragraph ', 4)
<p>The seven industries reporting growth in order backlogs in January — listed in order — are: Wood Products; Plastics &amp; Rubber Products; Electrical Equipment, Appliances &amp; Components; Primary Metals; Fabricated Metal Products; Miscellaneous Manufacturing; and Chemical Products. The seven industries reporting a decrease in order backlogs during January — listed in order — are: Nonmetallic Mineral Products; Textile Mills; Paper Products; Computer &amp; Electronic Products; Food, Beverage &amp; Tobacco Products; Transportation Equipment; and Furniture &amp; Related Products.</p>
('Paragraph ', 5)
<p>The eight industries reporting growth in new export orders in January — listed in order — are: Wood Products; Paper Products; Petroleum &amp; Coal Products; Chemical Products; Fabricated Metal Products; Transportation Equipment; Miscellaneous Manufacturing; and Food, Beverage &amp; Tobacco Products. The four industries reporting a decrease in new export orders during January are: Textile Mills; Nonmetallic Mineral Products; Plastics &amp; Rubber Products; and Machinery. Six industries reported no change in new export orders in January compared to December.</p>
('Paragraph ', 6)
<p>The four industries reporting growth in imports during the month of January are: Furniture &amp; Related Products; Apparel, Leather &amp; Allied Products; Fabricated Metal Products; and Food, Beverage &amp; Tobacco Products. The five industries reporting a decrease in imports during January are: Plastics &amp; Rubber Products; Primary Metals; Nonmetallic Mineral Products; Transportation Equipment; and Computer &amp; Electronic Products. Eight industries reported no change in imports in January compared to December.</p>
>>> saying.findNextSibling()
<ul style="list-style-type: square;">
<li>“Demand very steady to start the year.” (Chemical Products)</li>
<li>“January revenue target slightly lower following a big December shipment month.” (Computer &amp; Electronic Products)</li>
<li>“Strong start to the new year. Production is increasing and we are adding capacity.” (Plastics &amp; Rubber Products)</li>
<li>“Business looks stronger moving into the first quarter of 2017.” (Primary Metals)</li>
<li>“Economic outlook remains stable and no current effects of geopolitical changes appear to be penetrating market conditions.” (Food, Beverage &amp; Tobacco Products)</li>
<li>“Sales bookings are exceeding expectations. We are starting to see supply shortages in hot rolled steel due to the curtailment of imports.” (Machinery)</li>
<li>“Year starting on pace with Q4 2016.” (Transportation Equipment)</li>
<li>“Business conditions are good, demand is generally increasing.” (Miscellaneous Manufacturing)</li>
<li>“Conditions and outlook remain positive. Raw material prices are stable resulting in stable margins. Asset utilization remains high.” (Petroleum &amp; Coal Products)</li>
<li>“Steady demand from automotive.” (Fabricated Metal Products)</li>
</ul>
>>>

Is it a good idea to use regex to parse and extract websites that are updated periodically? I have read several posts with strong criticism about using regex for this purpose: http://stackoverflow.com/a/1732454/4399016 I used LXML, XPATH and Urllib2. It broke down the very next month when latest report was published. Anyway, thanks for the effort. — prashanth manohar, Feb 26 '17 at 17:37
We can only answer the questions that people like you ask. You implied that 'growth' and 'contraction' are key words in these pages. I think you'll find that the issue is not the use of regex but regularity. — Bill Bell, Feb 26 '17 at 17:41
I see what you mean. My need is to capture the different paragraphs on the page, the content of which changes periodically. I actually waited 1 month for the new report just to see if code is robust. — prashanth manohar, Feb 26 '17 at 17:49
Regexes can't parse HTML: that's perfectly true. But there is a difference between parsing HTML and extracting in a text some paragraph that contains certain words. For this, regexes are an excellent tool. The problem, as Bill said, is regularity. If the paragraphs you want always contain some precise words, then the regexes are appropriate. But if they are always the first HTML list just above the main table, then CSS or Xpath selectors will be more robust. Several examples of texts would be necessary to identify patterns and then to choose the best method. — Ettore Rizza, Feb 26 '17 at 20:34
A last point about Xpath: if you use the Chrome Web developer, it will tell you that the Xpath of the list you want is `//*[@id="home_feature_container"]/div/div[2]/div/ul` This is actually its current position, but maybe not next month. This is probably why your scraper worked only once. — Ettore Rizza, Feb 26 '17 at 21:01
This is why a text-based/regex search in the `strong` tag is more relevant. I'm not even sure that it's essential to specify what the `strong` tag should contain. This should work also by simply looking for a `ul` list immediately preceded by a `strong` : `//strong/following-sibling::ul` (or `//strong["WHAT RESPONDENTS ARE SAYING"]/following-sibling::ul` if you need to be more specific) — Ettore Rizza, Feb 27 '17 at 08:30

Extracting text from specific paragraphs of the website with Python 2

2 Answers2