1

I am trying to scrape data from Fangraphs. The tables are split into 21 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but Fangraphs does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 30 players, but I want the entire player pool. Two days of web searching and I am stuck. Link and my current code are below. I know they have a link to download the csv file, but that gets tedious through out the season and I would like expedite the data harvesting process. Any direction would be helpful, thank you.

https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc

import requests
import pandas as pd

url = 'https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=fangraphsdc&team=0&lg=all&players=0'

response = requests.get(url, verify=False)

# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')

# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]

# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
    cells = []
    tds = tr.find_all('td')
    if len(tds) == 0:
        ths = tr.find_all('th')
        for th in ths:
            cells.append(th.text.strip())
    else:
        for td in tds:
            cells.append(td.text.strip())
    rows.append(cells)

# convert table to df
table = pd.DataFrame(rows)
dpeters555
  • 13
  • 1
  • 4

1 Answers1

1
import requests
from bs4 import BeautifulSoup
import pandas as pd

params = {
    "pos": "all",
    "stats": "bat",
    "type": "fangraphsdc"
}

data = {
    'RadScriptManager1_TSM': 'ProjectionBoard1$dg1',
    "__EVENTTARGET": "ProjectionBoard1$dg1",
    '__EVENTARGUMENT': 'FireCommand:ProjectionBoard1$dg1$ctl00;PageSize;1000',
    '__VIEWSTATEGENERATOR': 'C239D6F0',
    '__SCROLLPOSITIONX': '0',
    '__SCROLLPOSITIONY': '1366',
    "ProjectionBoard1_tsStats_ClientState": "{\"selectedIndexes\":[\"0\"],\"logEntries\":[],\"scrollState\":{}}",
    "ProjectionBoard1_tsPosition_ClientState": "{\"selectedIndexes\":[\"0\"],\"logEntries\":[],\"scrollState\":{}}",
    "ProjectionBoard1$rcbTeam": "All+Teams",
    "ProjectionBoard1_rcbTeam_ClientState": "",
    "ProjectionBoard1$rcbLeague": "All",
    "ProjectionBoard1_rcbLeague_ClientState": "",
    "ProjectionBoard1_tsProj_ClientState": "{\"selectedIndexes\":[\"5\"],\"logEntries\":[],\"scrollState\":{}}",
    "ProjectionBoard1_tsUpdate_ClientState": "{\"selectedIndexes\":[],\"logEntries\":[],\"scrollState\":{}}",
    "ProjectionBoard1$dg1$ctl00$ctl02$ctl00$PageSizeComboBox": "30",
    "ProjectionBoard1_dg1_ctl00_ctl02_ctl00_PageSizeComboBox_ClientState": "",
    "ProjectionBoard1$dg1$ctl00$ctl03$ctl01$PageSizeComboBox": "1000",
    "ProjectionBoard1_dg1_ctl00_ctl03_ctl01_PageSizeComboBox_ClientState": "{\"logEntries\":[],\"value\":\"1000\",\"text\":\"1000\",\"enabled\":true,\"checkedIndices\":[],\"checkedItemsTextOverflows\":false}",
    "ProjectionBoard1_dg1_ClientState": ""
}


def main(url):
    with requests.Session() as req:
        r = req.get(url, params=params)
        soup = BeautifulSoup(r.content, 'html.parser')
        data['__VIEWSTATE'] = soup.find("input", id="__VIEWSTATE").get("value")
        data['__EVENTVALIDATION'] = soup.find(
            "input", id="__EVENTVALIDATION").get("value")
        r = req.post(url, params=params, data=data)
        df = pd.read_html(r.content, attrs={
                          'id': 'ProjectionBoard1_dg1_ctl00'})[0]
        df.drop(df.columns[1], axis=1, inplace=True)
        print(df)
        df.to_csv("data.csv", index=False)


main("https://www.fangraphs.com/projections.aspx")

Output: view-online

enter image description here

  • Thank you, the first time I tried it, I still only got the first 30 players. But the next time I tried it, I got everything. I am sure it was on my end. Now I just have to figure out what this code is doing so I can pull data from other pages on this site. Again, thank you for your help. – dpeters555 Apr 17 '20 at 19:53
  • @dpeters555 you've copied it while i was editing. now it should work 100%. – αԋɱҽԃ αмєяιcαη Apr 17 '20 at 19:56