Scrape site with multiple links without "next" button using beautiful soup

Question

I am very new to python (three days in) and I have stumbled into a problem I can't solve with google/youtube. I want to scrape the National Governors Association for background data of all US governors and save this into a csv file.

I have managed to scrape a list of all governors, but to get more details I need to enter the page of each governor individually and save the data. I have found code suggestions online which utilises a "next" button or the url structure to loop over several sites. This website, however, does not have a next button and the url-links does not follow a loopable structure. So I am stuck.

I would appreciate any help I can get very much. I want to extract the info above the main text (Office Dates, School(s) etc in the "address" tag) in each governors page, for example in this one.

This is what I have got so far:

import bs4 as bs
import urllib.request
import pandas as pd

url = 'https://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0=10&endcac77e09-db17-41cb-9de0-687b843338d0=9999&pagesizecac77e09-db17-41cb-9de0-687b843338d0=10&militaryService=&higherOfficesServed=&religion=&lastName=&sex=Any&honors=&submit=Search&college=&firstName=&party=&inOffice=Any&biography=&warsServed=&'

sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce, "html.parser")

#dl list of all govs
dfs = pd.read_html(url, header=0)
for df in dfs:
    df.to_csv('governors.csv')

#dl links to each gov
table = soup.find('table', 'table table-striped table-striped')
links = table.findAll('a')
with open ('governors_links.csv', 'w') as r:
    for link in links:
        r.write(link['href'])
        r.write('\n')
    r.close()

#enter each gov page and extract data in the "address" tag(s)
#save this in a csv file

What do you mean "the url-links does not follow a loopable structure"? You're extracting href URLs -- you just need to iterate over the URLs and use BeautifulSoup to scrape the structured data you need from each one. — cmaher, Jan 10 '18 at 17:54
Try this url. It will let you fetch all the data. I just kicked out the portion for next page from the url. Give it a try: `https://www.nga.org/cms/FormerGovBios?begincac77e09-db17-41cb-9de0-687b843338d0&endcac77e09-db17-41cb-9de0-687b843338d0=319&pagesizecac77e09-db17-41cb-9de0-687b843338d0=10&college=&lastName=&submit=Search&inOffice=Any&sex=Any&militaryService=&biography=&warsServed=&higherOfficesServed=&honors=&religion=&firstName=&party=&` — SIM, Jan 10 '18 at 20:05

Keyur Potdar · Answer 1 · 2018-01-10T18:01:52.503

1

I'm assuming that you've got all the links in a list named links.
You can do this to get the data you want of all the Governors one by one:

for link in links:
    r = urllib.request.urlopen(link).read()
    soup = bs.BeautifulSoup(r, 'html.parser')
    print(soup.find('h2').text)  # Name of Governor
    for p in soup.find('div', {'class': 'col-md-3'}).findAll('p'):
        print(p.text.strip())  # Office dates, address, phone, ...
    for p in soup.find('div', {'class': 'col-md-7'}).findAll('p'):
        print(p.text.strip())  # Family, school, birth state, ...

Edit:

Change your links list to

links = ['https://www.nga.org' + x.get('href') for x in table.findAll('a')]

edited Jan 10 '18 at 18:01

answered Jan 10 '18 at 17:56

Keyur Potdar

7,158
6
25
40

Thanks, this is on the right track! This works in the sense that it prints the stuff in the "address" tag(s). When I try to store it, however, I get errors. – klarinosaurus Jan 10 '18 at 21:12
This code stores, but inadequatly (repetitions and ugly): with open('governors_info.csv', 'w') as csvfile: for link in links: r = urllib.request.urlopen(link).read() soup = bs.BeautifulSoup(r, 'html.parser') csvfile.write(soup.find('h2').text) # Name of Gov for p in soup.find('div', {'class': 'col-md-3'}).findAll('p'): csvfile.write(p.text.strip()) # Office dates, address, ... for p in soup.find('div', {'class': 'col-md-7'}).findAll('p'): csvfile.write(p.text.strip()) # Family, school etc. csvfile.close() – klarinosaurus Jan 11 '18 at 15:15
It's ugly because of the formatting of the website. As you can see by inspecting that there are many whitespaces inside the `
...
` tags. How to make it pretty is for another question. I think it's already answered somewhere. – Keyur Potdar Jan 12 '18 at 04:27

Emmanuel Ferran · Answer 2 · 2018-01-11T17:40:31.707

This may work. I haven't tested it out to full completion since I'm at work but it should be a starting point for you.

import bs4 as bs
import requests
import re
def is_number(s):
    try:
        int(s)
        return True
    except ValueError:
        return False

def main():
    url = 'https://www.nga.org/cms/FormerGovBios?inOffice=Any&state=Any&party=&lastName=&firstName=&nbrterms=Any&biography=&sex=Any&religion=&race=Any&college=&higherOfficesServed=&militaryService=&warsServed=&honors=&birthState=Any&submit=Search'

    sauce = requests.get(url).text
    soup = bs.BeautifulSoup(sauce, "html.parser")
    finished = False
    csv_data = open('Govs.csv', 'a')
    csv_data.write('Name,Address,OfficeDates,Success,Address,Phone,Fax,Born,BirthState,Party,Schooling,Email')
    try:
        while not finished:
        #dl links to each gov
            table = soup.find('table', 'table table-striped table-striped')
            links = table.findAll('a')
            for link in links:
                info_array = []
                gov = {}
                name = link.string
                gov_sauce =  requests.get(r'https://nga.org'+link.get('href')).text
                gov_soup = bs.BeautifulSoup(gov_sauce, "html.parser")
                #print(gov_soup)
                office_and_stuff_info = gov_soup.findAll('address')
                for address in office_and_stuff_info:
                    infos = address.findAll('p')
                    for info in infos:
                        tex = re.sub('[^a-zA-Z\d:]','',info.text)
                        tex = re.sub('\\s+',' ',info.text)
                        tex = tex.strip()
                        if tex: 
                            info_array.append(tex)
                info_array = list(set(info_array))
                gov['Name'] = name
                secondarry_address = ''
                gov['Address'] = ''
                for line in info_array:
                    if 'OfficeDates:' in line:
                        gov['OfficeDates'] = line.replace('OfficeDates:','').replace('-','')
                    elif 'Succ' or 'Fail' in line:
                        gov['Success'] = line
                    elif 'Address' in line:
                        gov['Address'] = line.replace('Address:','')
                    elif 'Phone:' or 'Phone ' in line:
                        gov['Phone'] = line.replace('Phone ','').replace('Phone: ','')
                    elif 'Fax:' in line:
                        gov['Fax'] = line.replace('Fax:','')
                    elif 'Born:' in line:
                        gov['Born'] = line.replace('Born:','')
                    elif 'Birth State:' in line:
                        gov['BirthState'] = line.replace('BirthState:','')
                    elif 'Party:' in line:
                        gov['Party'] =  line.replace('Party:','')
                    elif 'School(s)' in line:
                        gov['Schooling'] = line.replace('School(s):','').replace('School(s) ')
                    elif 'Email:' in line:
                        gov['Email'] = line.replace('Email:','')
                    else:
                        secondarry_address = line
                gov['Address'] = gov['Address'] + secondarry_address
                data_line = gov['Name'] +','+gov['Address'] +','+gov['OfficeDates'] +','+gov['Success'] +','+gov['Address'] +','+ gov['Phone'] +','+ gov['Fax'] +','+gov['Born'] +','+gov['BirthState'] +','+gov['Party'] +','+gov['Schooling'] +','+gov['Email']
                csv_data.write(data_line)
            next_page_link = soup.find('ul','pagination center-blockdefault').find('a',{'aria-label':'Next'})
            if next_page_link.parent.get('class') == 'disabled':
                finished = True
            else:

                url = r'https://nga.org'+next_page_link.get('href')
                sauce = requests.get(url).text
                soup = bs.BeautifulSoup(sauce,'html.parser')
    except:
        print('Code failed.')
    finally:
        csv_data.close()
if __name__ == '__main__':
    main()

Thanks for the effort. After running for 30+ min I get "ConnectionError: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))" I am guessing that the nga.gov site is too slow for this approach? — klarinosaurus, Jan 10 '18 at 21:17
Either that or it's proceeding too quickly and the server shuts off the connection to "defend" itself. — Emmanuel Ferran, Jan 10 '18 at 21:22
I tried to learn from https://stackoverflow.com/questions/33895739/python-requests-cant-load-any-url-remote-end-closed-connection-without-respo and use request session, but after running for 2.5 h I got the following error: "links = table.findAll('a') AttributeError: 'NoneType' object has no attribute 'findAll'", which does not make any sense to me. :) — klarinosaurus, Jan 11 '18 at 15:21
I updated my code to append the file for each governer instead of waiting to write it all. That way even if it crashes you will still have some data. Give it a shot with the sessions fix you found and see how far that can go — Emmanuel Ferran, Jan 11 '18 at 17:41
Thanks, but it only writes the header row in the csv file and then it crashes before it writes the info from the urls. I don't know where the problem is since it is coded to write "Code failed." when it doesn't work. — klarinosaurus, Jan 12 '18 at 13:53
I agree that some websites will cut the connection on purpose, sometimes adding `time.sleep(5)` in the code, to let your code wait for 5 seconds at a certain point will help. Sometimes, some websites have bots to detect web scraping code, not sure whether beautifulsoup would be detected, but sometimes it didn't work for me. Python `Selenium` never disappointed me so far, even when there's bot. — Cherry Wu, Jan 22 '18 at 01:19
Also, personally, I won't suggest to write data into csv while doing the scraping, especially when you are not sure whether your code works all the time, it caused me troubles when I did that. How about save your data in a dictionary in `try` and write your data into csv in `finally`? — Cherry Wu, Jan 22 '18 at 01:20

Scrape site with multiple links without "next" button using beautiful soup

2 Answers2