Python - Scraping WayBack Machine, time out

Question

I am trying to download all the rosters of the 2011 NIH study sections from WayBack Machine. For this, I need to go on this link (https://web.archive.org/web/20111027104153/http://public.csr.nih.gov/StudySections/Standing/Pages/default.aspx), get all the links associated with the different study sections (first column), enter each of these links, get the link associated with "View Roster" and then scrape the names. I was able to get the links of every study section, but I am having trouble getting the "View Roster" link for each of them (code crashes).

So far, in order to get the study section links, I have the following code:

## GLOBAL SET-UP
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='PATH',options=options)

STARTING_URL = "https://web.archive.org/web/20111027104153/http://public.csr.nih.gov/StudySections/Standing/Pages/default.aspx"
headers={'User-Agent': 'Safari'}
starting_page=requests.get(STARTING_URL, headers=headers)
starting_soup = BeautifulSoup(starting_page.content, 'html.parser')

table = starting_soup.find('table', summary="This table contains information on CSR Meeting Roster")

overall_dict = {}
overall_dict_n_n = {}

counter = 1
counter_n_n=1

#####################################    
#STEP 1: Getting study sections link#
#####################################
for i, row in enumerate(table.find_all('tr')):
    if i == 0:
        header = [el.text.strip() for el in row.find_all('th')]
    else:
        href = row.find("a").get("href") # get the hyperlink for the person
        name = [el.text.strip() for el in row.find_all('td')] # get the name for the person
        overall_dict[counter] = [name, href]
        
        counter += 1

In order to go to the next step, I wanted to iterate through each link of the dictionary, make a request, and then get the link of "View Roster" with the following code:

#####################################
#STEP 2: Getting study sections link#
#####################################
for identifier, info_list in overall_dict.items():
    print(info_list[1])

    individuals_page=requests.get(info_list[1],headers=headers)
    starting_soup = BeautifulSoup(individuals_page.content, 'lxml')
    
    column = starting_soup.find("table", {"id": "Table1"})
    people_in_column=column.find_all("tr")[1]

    href_n_n = people_in_column.find_all("td")[3].find("a").get("href") # get the hyperlink for the person
    name_n_n = people_in_column.find_all("td")[0].find("font").contents[0] 
                 
             
    overall_dict_n_n[counter_n_n] = [name_n_n,href_n_n,info_list[0]]
    counter_n_n += 1 # increment the counter for the next person

But my loop crashes at the requests.get line: I get an error of the form "HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20111027104153/http://internet.csr.nih.gov/Roster_proto1/sectionI_list_detail.asp?NEWSRG=ACE&SRG=ACE&SRGDISPLAY=ACE (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcd72ce0c40>: Failed to establish a new connection: [Errno 60] Operation timed out'))"

I am not sure how to solve this problem, any help would be appreciated.

score 0 · Answer 1 · answered Jan 26 '23 at 00:05

It seems like you are requesting the web.archive.org public URL too many times with your loop causing the time-out event. Seeing that you are already using the requests package checkout the retry utility. With the retry utility you can automatically retry the request (in case of time-out) and have delays between the retries using backoff_factor to avoid your connection being refused.

Here's another SO question/answers that relates to your question.

Python - Scraping WayBack Machine, time out

1 Answers1