0

I have a function that (1) scrapes data from a list of URLs that each contain table data. It scrapes html text with BeautifulSoup to collect separate lists containing column headers and table rows. Then it (2) iterates through table row list to create a list of lists. Finally, (3) I have my call function in a for loop which iterates through the list of URLs.

The problem I'm having is that I can't figure out how to insert my column headers into my data such that the column headers appear in the final dataframe. Should I append/insert the column headers into the output list within the function? Or is there a way to insert it into the dataframe? (I can't insert the column headers into the dataframe after the function because the column_headers variable is local to the function so not available as a global variable.

Here's basically what I have so far:

my_list_of_urls = [a, list, of, several, urls]

def scraper_from_URL_list(url_parameter):

# get the html

        html = urlopen(url_parameter)

        # create the BeautifulSoup object
        soup = BeautifulSoup(html, "lxml")

        column_headers = [CSS SELECTOR GADGET TO GET COLUMN HEADER DATA]


        table_rows = soup.select(CSS SELECTOR GADGET TO GET TABLE ROW DATA)

        output_list = []

        for row in table_rows:  

            table_data_output = [COMMAND TO CONVERT TABLE ROW VARIABLE INTO AN ORGANIZED LIST OF LISTS

            output_list.append(table_data_output)

        return output_list





#To call the function and iterate through list of URLs to output it to a dataframe

df_output_list = pd.DataFrame()
for url in my_list_of_urls:
    df_output = pd.concat([df_output, pd.DataFrame(scrape_sports_stats(url))])
TJE
  • 570
  • 1
  • 5
  • 20
  • You can use ```pd.read_html()``` instead of creating the list of lists - [documentation link](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html) and to add columns to a dataframe use ```columns``` attribute of the ```DataFrame``` object - https://stackoverflow.com/q/11346283/2650427 – TrigonaMinima Aug 25 '17 at 00:12

1 Answers1

0

It seems it will be easiest if you change the return statement in your "scrape_sports_stats" function to this:

return pd.DataFrame(output_list, columns=column_headers)

You can then use a list comprehension inside pd.concat to build your concatenated DataFrame:

df_output = pd.concat([scrape_sports_stats(url) for url in my_list_of_urls])
cmaher
  • 5,100
  • 1
  • 22
  • 34