Only scrape a portion of the page

Question

I am using Python/requests to gather data from a website. Ideally I only want the latest 'banking' information, which always at the top of the page.

The code I have currently does that, but then it attempts to keep going and hits an index out of range error. I am not very good with aspx pages, but is it possible to only gather the data under the 'banking' heading?

Here's what I have so far:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}

print('Scraping South Dakota Banking Activity Actions...')

url2 = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'
r2 = requests.get(url2, headers=headers)


soup = BeautifulSoup(r2.text, 'html.parser')

mylist5 = []
for tr in soup.find_all('tr')[2:]:
    tds = tr.find_all('td')
    print(tds[0].text, tds[1].text)

Ideally I'd be able to slice the information as well so I can only show the activity or approval status, etc.

Pretty much the entire page is under that heading. By latest do you mean just _August 2019_ ? What should output look like? — QHarr, Sep 13 '19 at 16:51
@Qharr Ideally I would like the latest month and just the banking activity, if possible. — bobby_pine, Sep 30 '19 at 17:40

score 1 · Answer 1 · answered Sep 13 '19 at 17:45

With bs4 4.7.1 + you can use :contains to isolate the latest month by filtering out the later months. I explain the principle of filtering out later general siblings using :not in this SO answer. In short, find the row containing "August 2019" (this month is determined dynamically) and grab it and all its siblings, then find the row containing "July 2019" and all its general siblings and remove the latter from the former.

import requests, re
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx')
soup = bs(r.content, 'lxml')
months = [i.text for i in soup.select('[colspan="2"]:has(a)')][0::2]
latest_month = months[0]
next_month = months[1]
rows_of_interest = soup.select(f'tr:contains("{latest_month}"), tr:contains("{latest_month}") ~ tr:not(:contains("{next_month}"), :contains("{next_month}") ~ tr)')
results = []

for row in rows_of_interest:
    data = [re.sub('\xa0|\s{2,}',' ',td.text) for td in row.select('td')]
    if len(data) == 1:
        data.extend([''])
    results.append(data)
df = pd.DataFrame(results)
print(df)

no I got this error NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type. — bobby_pine, Oct 02 '19 at 10:45
are you using bs4 4.7.1 + ? You should upgrade your bs4 to at least this anyway for the added functionality and improved code base — QHarr, Oct 02 '19 at 10:46

score 0 · Answer 2 · answered Sep 13 '19 at 17:09

Same as before

import requests
from bs4 import BeautifulSoup, Tag

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'

print('Scraping South Dakota Banking Activity Actions...')
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

Inspecting data source, we can find the id of the element you need (the table of values).

banking = soup.find(id='secondarycontent')

After this, we filter out elements of soup that aren't tags (like NavigableString or others). You can see how to get texts too (for other options, check Tag doc).

blocks = [b for b in banking.table.contents if type(b) is Tag]  # filter out NavigableString
texts = [b.text for b in blocks]

Now, if it's the goal you're achieving when you talk about latest, we must determine which month is latest and which is the month before.

current_month_idx, last_month_idx = None, None
current_month, last_month = 'August 2019', 'July 2019'  # can parse with datetime too
for i, b in enumerate(blocks):
    if current_month in b.text:
        current_month_idx = i
    elif last_month in b.text:
        last_month_idx = i

    if all(idx is not None for idx in (current_month_idx, last_month_idx)):
        break  # break when both indeces are not null

assert current_month_idx < last_month_idx

curr_month_blocks = [b for i, b in enumerate(blocks) if current_month_idx < i < last_month_idx]
curr_month_texts = [b.text for b in curr_month_blocks]

is there anyway just to show the contents of the 'banking' heading? @crissal — bobby_pine, Sep 30 '19 at 17:40

Only scrape a portion of the page

2 Answers2