How to scrape multi page website with python?

Question

I need to scrape the following table: https://haexpeditions.com/advice/list-of-mount-everest-climbers/

How to do it with python?

Driftr95 · Answer 1 · 2023-01-02T05:30:33.133

The site uses this API to fetch the table data, so you could request it from there.

_{(I used cloudscraper because it's easier than trying to figure out how to set the right set of requests headers to avoid getting a 406 error response; and using the try..except...print approach (instead of just doing tableData = [dict(...) for row in api_req.json()] directly) helps understand what went wrong in case of error [without actually raising any errors that might break the program execution.])}

# import cloudscraper
api_url = 'https://haexpeditions.com/wp-admin/admin-ajax.php?action=wp_ajax_ninja_tables_public_action&table_id=1084&target_action=get-all-data&default_sorting=old_first&ninja_table_public_nonce=2491a56a39&chunk_number=0'
api_req = cloudscraper.create_scraper().get(api_url)

try: jData, jdMsg = api_req.json(), f'- {len(api_req.json())} rows from'
except Exception as e: jData, jdMsg = [], f'failed to get data - {e} \nfrom'
print(api_req.status_code, api_req.reason, jdMsg, api_req.url)

tableData = [dict([(k, v) for k, v in row['value'].items()] + [
    (f'{k}_options', v) for k, v in row['options'].items()
]) for row in jData]

At this point tableData is a list of dictionaries but you can build a DataFrame from it with pandas and save it to a CSV file with .to_csv.

# import pandas
pandas.DataFrame(tableData).set_index('number').to_csv('list_of_mount_everest_climbers.csv')

The API URL can be either copied from the browser network logs or extracted from the script tag containing it in the source HTML of the page.

The shorter way would be to just split the HTML string:

# import cloudscraper
pg_url = 'https://haexpeditions.com/advice/list-of-mount-everest-climbers/'
pg_req = cloudscraper.create_scraper().get(pg_url)
api_url = pg_req.text.split('"data_request_url":"', 1)[-1].split('"')[0]
api_url = api_url.replace('\\', '')
print(pg_req.status_code,pg_req.reason,'from',pg_req.url,'\napi_url:',api_url)

However, it's a little risky in case "data_request_url":" appears in any other context in the HTML aside from the one that we want. So, another way would be to parse with bs4 and json.

# import cloudscraper
# from bs4 import BeautifulSoup
# import json

pg_url = 'https://haexpeditions.com/advice/list-of-mount-everest-climbers/'
sel = 'div.footer.footer-inverse>div.bottom-bar+script[type="text/javascript"]'
api_url = 'https://haexpeditions.com/wp-admin/admin-ajax.php...' ## will be updated

pg_req = cloudscraper.create_scraper().get(pg_url)
jScript = BeautifulSoup(pg_req.content).select_one(sel) 
try:
    sjData = json.loads(jScript.get_text().split('=',1)[-1].strip())
    api_url = sjData['init_config']['data_request_url']
    auMsg = f'\napi_url: {api_url}'
except Exception as e: auMsg = f'failed to extract API URL - {type(e)} {e}'
print(pg_req.status_code,pg_req.reason,'from',pg_req.url,'\napi_url:',auMsg)

(I would consider the second method more reliable even though it's a bit longer.)

How to scrape multi page website with python?

1 Answers1