I have a function that (1) scrapes data from a list of URLs that each contain table data. It scrapes html text with BeautifulSoup to collect separate lists containing column headers and table rows. Then it (2) iterates through table row list to create a list of lists. Finally, (3) I have my call function in a for loop which iterates through the list of URLs.
The problem I'm having is that I can't figure out how to insert my column headers into my data such that the column headers appear in the final dataframe. Should I append/insert the column headers into the output list within the function? Or is there a way to insert it into the dataframe? (I can't insert the column headers into the dataframe after the function because the column_headers variable is local to the function so not available as a global variable.
Here's basically what I have so far:
my_list_of_urls = [a, list, of, several, urls]
def scraper_from_URL_list(url_parameter):
# get the html
html = urlopen(url_parameter)
# create the BeautifulSoup object
soup = BeautifulSoup(html, "lxml")
column_headers = [CSS SELECTOR GADGET TO GET COLUMN HEADER DATA]
table_rows = soup.select(CSS SELECTOR GADGET TO GET TABLE ROW DATA)
output_list = []
for row in table_rows:
table_data_output = [COMMAND TO CONVERT TABLE ROW VARIABLE INTO AN ORGANIZED LIST OF LISTS
output_list.append(table_data_output)
return output_list
#To call the function and iterate through list of URLs to output it to a dataframe
df_output_list = pd.DataFrame()
for url in my_list_of_urls:
df_output = pd.concat([df_output, pd.DataFrame(scrape_sports_stats(url))])