0

I am using Python/requests to gather data from a website. Ideally I only want the latest 'banking' information, which always at the top of the page.

The code I have currently does that, but then it attempts to keep going and hits an index out of range error. I am not very good with aspx pages, but is it possible to only gather the data under the 'banking' heading?

Here's what I have so far:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}

print('Scraping South Dakota Banking Activity Actions...')

url2 = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'
r2 = requests.get(url2, headers=headers)


soup = BeautifulSoup(r2.text, 'html.parser')

mylist5 = []
for tr in soup.find_all('tr')[2:]:
    tds = tr.find_all('td')
    print(tds[0].text, tds[1].text)

Ideally I'd be able to slice the information as well so I can only show the activity or approval status, etc.

bobby_pine
  • 41
  • 6

2 Answers2

1

With bs4 4.7.1 + you can use :contains to isolate the latest month by filtering out the later months. I explain the principle of filtering out later general siblings using :not in this SO answer. In short, find the row containing "August 2019" (this month is determined dynamically) and grab it and all its siblings, then find the row containing "July 2019" and all its general siblings and remove the latter from the former.

import requests, re
from bs4 import BeautifulSoup as bs
import pandas as pd

r = requests.get('https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx')
soup = bs(r.content, 'lxml')
months = [i.text for i in soup.select('[colspan="2"]:has(a)')][0::2]
latest_month = months[0]
next_month = months[1]
rows_of_interest = soup.select(f'tr:contains("{latest_month}"), tr:contains("{latest_month}") ~ tr:not(:contains("{next_month}"), :contains("{next_month}") ~ tr)')
results = []

for row in rows_of_interest:
    data = [re.sub('\xa0|\s{2,}',' ',td.text) for td in row.select('td')]
    if len(data) == 1:
        data.extend([''])
    results.append(data)
df = pd.DataFrame(results)
print(df)
QHarr
  • 83,427
  • 12
  • 54
  • 101
  • no I got this error NotImplementedError: Only the following pseudo-classes are implemented: nth-of-type. – bobby_pine Oct 02 '19 at 10:45
  • are you using bs4 4.7.1 + ? You should upgrade your bs4 to at least this anyway for the added functionality and improved code base – QHarr Oct 02 '19 at 10:46
0

Same as before

import requests
from bs4 import BeautifulSoup, Tag

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.94 Safari/537.36'}
url = 'https://dlr.sd.gov/banking/monthly_activity_reports/monthly_activity_reports.aspx'

print('Scraping South Dakota Banking Activity Actions...')
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')

Inspecting data source, we can find the id of the element you need (the table of values).

banking = soup.find(id='secondarycontent')

After this, we filter out elements of soup that aren't tags (like NavigableString or others). You can see how to get texts too (for other options, check Tag doc).

blocks = [b for b in banking.table.contents if type(b) is Tag]  # filter out NavigableString
texts = [b.text for b in blocks]

Now, if it's the goal you're achieving when you talk about latest, we must determine which month is latest and which is the month before.

current_month_idx, last_month_idx = None, None
current_month, last_month = 'August 2019', 'July 2019'  # can parse with datetime too
for i, b in enumerate(blocks):
    if current_month in b.text:
        current_month_idx = i
    elif last_month in b.text:
        last_month_idx = i

    if all(idx is not None for idx in (current_month_idx, last_month_idx)):
        break  # break when both indeces are not null

assert current_month_idx < last_month_idx

curr_month_blocks = [b for i, b in enumerate(blocks) if current_month_idx < i < last_month_idx]
curr_month_texts = [b.text for b in curr_month_blocks]
crissal
  • 2,547
  • 7
  • 25