0

I am using a combination of beautifulsoup and pandas to try and get sports reference data by looping through boxscore pages, obtaining the dataframes for each team and concatenating them all together. I noticed that the way the table is formatted on each page, there are row dividers separating the starters from the reserves, and this row divider has the value "Reserves" in the 'Starter' column (which I later rename to 'Player_Name'), with the remaining column headers repeated for the rest of its values. When this data is input into the dataframe, the row dividers are brought in as a normal row. I would like to add a separate column that holds a Y/N value for whether or not that player started the game and remove all records where the 'Starters' column is equal to "Reserves".

I have tried adding a column but I'm struggling with a method to get the default values to be "Y" for the first x number of rows and "N" for the remaining rows.

Here is a brief example of the table followed by the code I am using. Let me know if you have any thoughts!

EDIT: I may have oversimplified this, as there are actually two header columns and it appears this is causing an issue when trying the solutions presented. How can I remove the first header column that just states 'Basic Box Score Stats' and 'Advanced Box Score Stats'?

Basic Box Score Stats            Advanced Box Score Stats
Starters              MP    FG   +/-  xyz%
Player1               20:00 17   5    12
Player2               15:00 8    4    10
Player3               10:00 9    3    8
Player4               9:00  3    2    6
Player5               8:00  1    1    4
Reserves              MP    FG   +/-  xyz%
Player4               7:00  1    1    2
Player5               4:00  1    1    2
Player6               3:30  1    1    2
import pandas as pd
from bs4 import BeautifulSoup
#performed steps in bs4 to get the links to individual boxscores
    for boxscore_link in boxscore_links:
        basketball_ref_dfs=pd.read_html(MainURL + boxscore_link)
        if len(basketball_ref_dfs) = 4:
            away_team_stats = pd.concat([basketball_ref_dfs[0],basketball_ref_dfs[1]])
            home_team_stats = pd.concat([basketball_ref_dfs[2],basketball_ref_dfs[3]])
        else:
            away_team_stats = basketball_ref_dfs[0]
            home_team_stats = basketball_ref_dfs[1]
#new code to be added here to fix 'reserve' row header for away/home_team_stats        
full_game_stats = pd.concat([away_team_stats,home_team_stats])
        full_season_stats = full_season_stats.append(full_game_stats,ignore_index=True)
    full_season_stats

#what I want:
away_team_stats['Starter']='Y' # + some condition to only set this value for the first x occurrences or set to 'Y' until row value equals Reserve, then set remaining to 'N'

2 Answers2

0

You can do this in three steps:

  1. Set the default value 'N' for the entire column using away_team_stats['Starter']='N'
  2. Set the value for the first x rows to be 'Y' using the iloc method with away_team_stats.iloc[:x, 2]='Y' (I believe the 'Starter' column will be in position 2 if appending to your example data but you may need to edit this)
  3. Remove the row with 'Player_Name' == 'Reserves' by using the loc method with away_team_stats = away_team_stats.loc[away_team_stats['Player_Name']!='Reserves', :]

The iloc method will slice your dataframe by numerical index/column and the loc method will slice your dataframe by index/column label

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html

NickHilton
  • 662
  • 6
  • 13
  • Thanks! Steps 1 and 2 make sense, and in step 3 everything up until the last comma makes sense. It is locating instances where the 'Player_Name' column is not equal to 'Reserves' and making that subset the new dataframe. What do the final comma and colon do? – Mohammed Mahfuz Jul 17 '19 at 21:10
  • Thats right, when you do call `df.loc[x,y]` the `x` condition tells the `loc` which rows to get and the y condition tells the `loc` which columns to get. The final colon means return all columns in this case, and in the `iloc` case in step 2 means return all rows up until the xth row. – NickHilton Jul 17 '19 at 23:03
  • Gotcha, that makes sense! I am running into another issue, I don't know if it makes sense to get another thread but the data appears to have two header rows and I want to try and remove the first header which doesn't provide any important information. I've updated the question and table accordingly. – Mohammed Mahfuz Jul 18 '19 at 13:35
  • That is caused by your columns having a multiindex, i.e. there are two levels to classify the columns. see (https://stackoverflow.com/questions/22233488/pandas-drop-a-level-from-a-multi-level-column-index) for how to drop the level – NickHilton Jul 18 '19 at 16:50
  • Ah perfect thank you so much! Can't upvote but I appreciate the help! – Mohammed Mahfuz Jul 18 '19 at 20:38
  • Thats ok, you can accept the answer if you cant upvote yet https://stackoverflow.com/help/someone-answers – NickHilton Jul 18 '19 at 22:33
0

You can do, if you know the index already where your 'Reserve' value appears, let's say in this case it appears in the 10th record. I initially set everything to 'N', and then turn the first 10 rows to 'Y'.

away_team_stats['Starter'] = 'N'
away_team_stats.loc[:9, 'Starter'] = 'Y'


Or you can do:
idx = away_team_stats.loc[away_team_stats['Starter'] == 'Reserve'].index[0]

This gives you in which index 'Reserve' appears for the first time.

You can now do it as above:

away_team_stats.loc[:idx, 'Starter'] = 'Y'
away_team_stats.loc[idx+1:, 'Starter'] = 'N'

Sets the first few rows until the word 'Reserve' appeared for the first time to 'Y', and then sets the remaining to 'N'.

Ankur Sinha
  • 6,473
  • 7
  • 42
  • 73
  • I like this method of finding where the first instance of 'Reserve' is in case there is ever an instance where it isn't exactly X rows down, however I am running into another issue that seems to prevent me from trying this, please see edit above. – Mohammed Mahfuz Jul 17 '19 at 21:09