I am trying to download all the rosters of the 2011 NIH study sections from WayBack Machine. For this, I need to go on this link (https://web.archive.org/web/20111027104153/http://public.csr.nih.gov/StudySections/Standing/Pages/default.aspx), get all the links associated with the different study sections (first column), enter each of these links, get the link associated with "View Roster" and then scrape the names. I was able to get the links of every study section, but I am having trouble getting the "View Roster" link for each of them (code crashes).
So far, in order to get the study section links, I have the following code:
## GLOBAL SET-UP
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(executable_path='PATH',options=options)
STARTING_URL = "https://web.archive.org/web/20111027104153/http://public.csr.nih.gov/StudySections/Standing/Pages/default.aspx"
headers={'User-Agent': 'Safari'}
starting_page=requests.get(STARTING_URL, headers=headers)
starting_soup = BeautifulSoup(starting_page.content, 'html.parser')
table = starting_soup.find('table', summary="This table contains information on CSR Meeting Roster")
overall_dict = {}
overall_dict_n_n = {}
counter = 1
counter_n_n=1
#####################################
#STEP 1: Getting study sections link#
#####################################
for i, row in enumerate(table.find_all('tr')):
if i == 0:
header = [el.text.strip() for el in row.find_all('th')]
else:
href = row.find("a").get("href") # get the hyperlink for the person
name = [el.text.strip() for el in row.find_all('td')] # get the name for the person
overall_dict[counter] = [name, href]
counter += 1
In order to go to the next step, I wanted to iterate through each link of the dictionary, make a request, and then get the link of "View Roster" with the following code:
#####################################
#STEP 2: Getting study sections link#
#####################################
for identifier, info_list in overall_dict.items():
print(info_list[1])
individuals_page=requests.get(info_list[1],headers=headers)
starting_soup = BeautifulSoup(individuals_page.content, 'lxml')
column = starting_soup.find("table", {"id": "Table1"})
people_in_column=column.find_all("tr")[1]
href_n_n = people_in_column.find_all("td")[3].find("a").get("href") # get the hyperlink for the person
name_n_n = people_in_column.find_all("td")[0].find("font").contents[0]
overall_dict_n_n[counter_n_n] = [name_n_n,href_n_n,info_list[0]]
counter_n_n += 1 # increment the counter for the next person
But my loop crashes at the requests.get line: I get an error of the form "HTTPSConnectionPool(host='web.archive.org', port=443): Max retries exceeded with url: /web/20111027104153/http://internet.csr.nih.gov/Roster_proto1/sectionI_list_detail.asp?NEWSRG=ACE&SRG=ACE&SRGDISPLAY=ACE (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fcd72ce0c40>: Failed to establish a new connection: [Errno 60] Operation timed out'))"
I am not sure how to solve this problem, any help would be appreciated.