Function to web scrape tables from several pages

Question

I am learning Python and I am trying to create a function to web scrape tables of vaccination rates from several different web pages - a github repository for Our World in Data https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data and https://ourworldindata.org/about. The code works perfectly when web scraping a single table and saving it into a data frame...

import requests  
from bs4 import BeautifulSoup 
import pandas as pd

url = "https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/country_data/Bangladesh.csv"
response = requests.get(url) 
response

scraping_html_table_BD = BeautifulSoup(response.content, "lxml") 
scraping_html_table_BD = scraping_html_table_BD.find_all("table", "js-csv-data csv-data js-file-line-container")
df = pd.read_html(str(scraping_html_table_BD))
BD_df = df[0]

But I have not had much luck when trying to create a function to scrape several pages. I have been following the tutorial on this website 3 in the section 'Scrape multiple pages with one script' and StackOverflow questions like 4 and 5 amongst other pages. I have tried creating a global variable first but I end up with errors like "Recursion Error: maximum recursion depth exceeded while calling a Python object". This is the best code I have managed as it doesn't generate an error but I've not managed to save the output to a global variable. I really appreciate your help.

import pandas as pd  
from bs4 import BeautifulSoup
import requests

link_list = ['/Bangladesh.csv',
             '/Nepal.csv',
              '/Mongolia.csv']

def get_info(page_url):
    page = requests.get('https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data' + page_url)
    scape = BeautifulSoup(page.text, 'html.parser')    
    vaccination_rates = scape.find_all("table", "js-csv-data csv-data js-file-line-container")
    result = {}

    df = pd.read_html(str(vaccination_rates))
    vaccination_rates = df[0]
    df = pd.DataFrame(vaccination_rates)
    print(df)
    df.to_csv("testdata.csv", index=False)

     
for link in link_list:
    get_info(link)

edit: I can view the final webpage that is iterated as it saves to a csv file, but not the data from the preceding links.

new = pd.read_csv('testdata6.csv')
pd.set_option("display.max_rows", None, "display.max_columns", None)
new

score 0 · Answer 1 · answered May 28 '21 at 01:30

0

This is because in every iteration your 'testdata.csv' is overwritten with a new one. so you can do : df.to_csv(page_url[1:], index=False)

answered May 28 '21 at 01:30

Khushal

541
5
13

Thanks for your help, I didn't manage to get this line working, but I'm happy to go with '''df.to_csv('log.csv', mode='a', index=False, header=False)''' for now but it is something for later cheers – tricycle May 28 '21 at 03:45

score 0 · Answer 2 · answered May 28 '21 at 01:30

0

I'm guessing you're overwriting your 'testdata.csv' each time, hence why you can see the final page. I would either add an enumerate function to add an identifier for a separate csv each time you scrape a page, eg:

for key, link in enumerate(link_list):
get_info(link, key)
...
df.to_csv(f"testdata{key}.csv", index=False)

Or, open this csv as part of your get_info function, steps of which are available in append new row to old csv file python.

answered May 28 '21 at 01:30

Da1ne

74
3

Thanks for your help!! I managed to append the rows to the csv file (then read it into a dataframe) using the line: '''df.to_csv('log.csv', mode='a', index=False, header=False)''' taken from the link you posted. But I will work on enumerate to see if I can get it working, it will be handy to use. I had also tried iterrows without luck. – tricycle May 28 '21 at 03:30

Function to web scrape tables from several pages

2 Answers2