I'm having some issues with cleaning up data to be used in a prediction project, March madness picks. Sample raw data;
Winner Stats
,,Overall,Overall,Overall,Overall,Overall,Overall,Conf.,Conf.,Home,Home,Away,Away,Points,Points,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced
Rk,School,Games,Wins,Loss,W-L%,SRS,SOS,Wins,Loss,Wins,Loss,Wins,Loss,Team,Opp.,Pace,ORtg,FTr,3PAr,TS%,TRB%,AST%,STL%,BLK%,eFG%,TOV%,ORB%,FT/FGA
1,Abilene Christian NCAA,29,24,5,0.828,6.27,-6.37,13,2,13,0,5,4,2196,1753,72.0,105.2,0.339,0.367,0.540,51.5,65.3,13.5,8.9,0.512,16.2,31.3,0.230
2,Air Force,25,5,20,0.200,-12.98,0.22,3,17,3,8,1,10,1468,1798,63.9,91.9,0.309,0.450,0.547,40.1,59.0,11.1,8.0,0.521,21.2,15.3,0.214
3,Akron,23,15,8,0.652,1.85,-1.96,12,6,9,1,5,5,1798,1660,70.1,110.2,0.303,0.461,0.555,51.7,48.6,6.9,8.9,0.520,14.3,31.0,0.230
The first issues was the data had several headers lines, so used this code to skip those lines;
**WinnerStats = pd.read_csv('Winners_Stats.csv', skiprows=2)**
This worked and gave;
Rk School Games Wins Loss W-L% SRS SOS Wins.1 Loss.1 ... 3PAr TS% TRB% AST% STL% BLK% eFG% TOV% ORB% FT/FGA
0 1 Abilene Christian NCAA 29 24 5 0.828 6.27 -6.37 13 2 ... 0.367 0.540 51.5 65.3 13.5 8.9 0.512 16.2 31.3 0.230
1 2 Air Force 25 5 20 0.200 -12.98 0.22 3 17 ... 0.450 0.547 40.1 59.0 11.1 8.0 0.521 21.2 15.3 0.214
2 3 Akron 23 15 8 0.652 1.85 -1.96 12 6 ... 0.461 0.555 51.7 48.6 6.9 8.9 0.520 14.3 31.0 0.230
[356 rows x 29 columns]
Now, reading the data information, if the team name has NCAA append at the end, they went to the tournament. So I wrote this code after much research to filter all the rows that had NCAA in the name.
**mask1 = WinnerStats['School'].str.contains('NCAA', case=False, na=False)
ws1 = WinnerStats[mask1]**
This worked.
Rk School Games Wins Loss W-L% SRS SOS Wins.1 Loss.1 ... 3PAr TS% TRB% AST% STL% BLK% eFG% TOV% ORB% FT/FGA
0 1 Abilene Christian 29 24 5 0.828 6.27 -6.37 13 2 ... 0.367 0.540 51.5 65.3 13.5 8.9 0.512 16.2 31.3 0.230
6 7 Alabama 33 26 7 0.788 19.58 10.01 16 2 ... 0.465 0.544 51.7 50.2 11.3 10.6 0.517 15.9 31.8 0.204
10 11 Appalachian State 29 17 12 0.586 -5.84 -5.72 7 8 ... 0.437 0.520 50.8 49.9 11.6 10.4 0.478 14.9 29.8 0.252
[68 rows x 29 columns]
So, now I'm at the point where I want to start using the data to predict some outcomes (this is old data so I already know what the program should be able to predict). I need to match the two teams up T1 vs T2. I have another dataset that has the match ups, but it's by team ID. I have a third datasets that has team name and teamID.
**TeamID = pd.read_csv('MTeams.csv')
mask1 = WinnerStats['School'].str.contains('NCAA', case=False, na=False)
ws1 = WinnerStats[mask1] # just the NCAA teams
ws1['School'] = WinnerStats['School'].str.slice(start=0, stop=-5) #Remove NCAA from the school name**
This code gets all rows that has NCAA in the name via mask into ws1. But the last line to remove the NCAA works, but gives this warning;
mm2021_Data_Cleaner.py:46: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
ws1['School'] = WinnerStats['School'].str.slice(start=0, stop=-5)
I don't know enough about python to say, I can ignore the warning, or make some kind change. I followed the link, and none of the explanation seem to make any sense to me and what I'm trying to do.
The second problem is trying to code up a way to compare the names from the TeamID to the WinningStats. To match up TeamID to WinnerStat I needs to match up on "similar school name"; Example. "Abilene Christian" (from WinnerStat) to "Abilene Chr" (TeamID). So I this code
**ws1['TeamID'] = TeamID['TeamID'].where(WinnerStats['School'].str.contains(TeamID['School']))**
gives this error;
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
return self._engine.get_loc(casted_key)
File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'School'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/March_Madness/mm2021_Data_Cleaner.py", line 46, in <module>
ws1['TeamID'] = TeamID['TeamID'].where(WinnerStats['School'].str.contains(TeamID['School']))
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 3505, in __getitem__
indexer = self.columns.get_loc(key)
File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
raise KeyError(key) from err
KeyError: 'School'
dennis@MBP2021 March_Madness %
I have tried several different permutations, other code that I'm pretty lost and the "think" this is the last code I tried. All with similar strange error messages.