I am using a combination of beautifulsoup and pandas to try and get sports reference data by looping through boxscore pages, obtaining the dataframes for each team and concatenating them all together. I noticed that the way the table is formatted on each page, there are row dividers separating the starters from the reserves, and this row divider has the value "Reserves" in the 'Starter' column (which I later rename to 'Player_Name'), with the remaining column headers repeated for the rest of its values. When this data is input into the dataframe, the row dividers are brought in as a normal row. I would like to add a separate column that holds a Y/N value for whether or not that player started the game and remove all records where the 'Starters' column is equal to "Reserves".
I have tried adding a column but I'm struggling with a method to get the default values to be "Y" for the first x number of rows and "N" for the remaining rows.
Here is a brief example of the table followed by the code I am using. Let me know if you have any thoughts!
EDIT: I may have oversimplified this, as there are actually two header columns and it appears this is causing an issue when trying the solutions presented. How can I remove the first header column that just states 'Basic Box Score Stats' and 'Advanced Box Score Stats'?
Basic Box Score Stats Advanced Box Score Stats
Starters MP FG +/- xyz%
Player1 20:00 17 5 12
Player2 15:00 8 4 10
Player3 10:00 9 3 8
Player4 9:00 3 2 6
Player5 8:00 1 1 4
Reserves MP FG +/- xyz%
Player4 7:00 1 1 2
Player5 4:00 1 1 2
Player6 3:30 1 1 2
import pandas as pd
from bs4 import BeautifulSoup
#performed steps in bs4 to get the links to individual boxscores
for boxscore_link in boxscore_links:
basketball_ref_dfs=pd.read_html(MainURL + boxscore_link)
if len(basketball_ref_dfs) = 4:
away_team_stats = pd.concat([basketball_ref_dfs[0],basketball_ref_dfs[1]])
home_team_stats = pd.concat([basketball_ref_dfs[2],basketball_ref_dfs[3]])
else:
away_team_stats = basketball_ref_dfs[0]
home_team_stats = basketball_ref_dfs[1]
#new code to be added here to fix 'reserve' row header for away/home_team_stats
full_game_stats = pd.concat([away_team_stats,home_team_stats])
full_season_stats = full_season_stats.append(full_game_stats,ignore_index=True)
full_season_stats
#what I want:
away_team_stats['Starter']='Y' # + some condition to only set this value for the first x occurrences or set to 'Y' until row value equals Reserve, then set remaining to 'N'