Creating Pandas Dataframe from WebScraping Results

Question

I am trying to scrape a table from espn and send the data to a pandas dataframe in order to export it to excel. I have completed most of the scraping, but am getting stuck on how to send each 'td' tag to a unique dataframe cell within my for loop. (Code is below) Any thoughts? Thanks!

import requests
import urllib.request
from bs4 import BeautifulSoup
import re
import os
import csv
import pandas as pd

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, "html.parser")
    return soupdata

soup = make_soup("http://www.espn.com/nba/statistics/player/_/stat/scoring-
per-game/sort/avgPoints/qualified/false")

regex = re.compile("^[e-o]")

for record in soup.findAll('tr', {"class":regex}):
    for data in record.findAll('td'):
        print(data)

What? The regex is there to remove the multiple headers that appear every n rows.. — johankent30, Oct 05 '17 at 20:27
Where is the removal? You are applying regex on BeautifulSoup's parsing function, `findAll()`. Hence the above link. — Parfait, Oct 05 '17 at 20:35

score 0 · Accepted Answer · answered Oct 05 '17 at 21:53

I was actually recently scraping sports websites working on a daily fantasy sports algorithm for a class. This is the script I wrote up. Perhaps this approach can work for you. Build a dictionary. Convert it to a dataframe.

    url = http://www.footballdb.com/stats/stats.html?lg=NFL&yr={0}&type=reg&mode={1}&limit=all

    result = requests.get(url)
    c = result.content

    # Set as Beautiful Soup Object
    soup = BeautifulSoup(c)

    # Go to the section of interest
    tables = soup.find("table",{'class':'statistics'})

    data = {}
    headers = {}
    for i, header in enumerate(tables.findAll('th')):
        data[i] = {}
        headers[i] = str(header.get_text())

    table = tables.find('tbody')
    for r, row in enumerate(table.select('tr')):
        for i, cell in enumerate(row.select('td')):
            try:
                data[i][r] = str(cell.get_text())
            except:
                stat = strip_non_ascii(cell.get_text())
                data[i][r] = stat

    for i, name in enumerate(tables.select('tbody .left .hidden-xs a')):
        data[0][i] = str(name.get_text())

    df = pd.DataFrame(data=data)

Creating Pandas Dataframe from WebScraping Results

1 Answers1