Massaging data issues for an AI learning project using python which I'm also learning

Question

I'm having some issues with cleaning up data to be used in a prediction project, March madness picks. Sample raw data;

Winner Stats
,,Overall,Overall,Overall,Overall,Overall,Overall,Conf.,Conf.,Home,Home,Away,Away,Points,Points,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced,School Advanced
Rk,School,Games,Wins,Loss,W-L%,SRS,SOS,Wins,Loss,Wins,Loss,Wins,Loss,Team,Opp.,Pace,ORtg,FTr,3PAr,TS%,TRB%,AST%,STL%,BLK%,eFG%,TOV%,ORB%,FT/FGA
1,Abilene Christian NCAA,29,24,5,0.828,6.27,-6.37,13,2,13,0,5,4,2196,1753,72.0,105.2,0.339,0.367,0.540,51.5,65.3,13.5,8.9,0.512,16.2,31.3,0.230
2,Air Force,25,5,20,0.200,-12.98,0.22,3,17,3,8,1,10,1468,1798,63.9,91.9,0.309,0.450,0.547,40.1,59.0,11.1,8.0,0.521,21.2,15.3,0.214
3,Akron,23,15,8,0.652,1.85,-1.96,12,6,9,1,5,5,1798,1660,70.1,110.2,0.303,0.461,0.555,51.7,48.6,6.9,8.9,0.520,14.3,31.0,0.230

The first issues was the data had several headers lines, so used this code to skip those lines;

**WinnerStats = pd.read_csv('Winners_Stats.csv', skiprows=2)**

This worked and gave;

      Rk                  School  Games  Wins  Loss   W-L%    SRS    SOS  Wins.1  Loss.1  ...   3PAr    TS%  TRB%  AST%  STL%  BLK%   eFG%  TOV%  ORB%  FT/FGA
0      1  Abilene Christian NCAA     29    24     5  0.828   6.27  -6.37      13       2  ...  0.367  0.540  51.5  65.3  13.5   8.9  0.512  16.2  31.3   0.230
1      2               Air Force     25     5    20  0.200 -12.98   0.22       3      17  ...  0.450  0.547  40.1  59.0  11.1   8.0  0.521  21.2  15.3   0.214
2      3                   Akron     23    15     8  0.652   1.85  -1.96      12       6  ...  0.461  0.555  51.7  48.6   6.9   8.9  0.520  14.3  31.0   0.230

[356 rows x 29 columns]

Now, reading the data information, if the team name has NCAA append at the end, they went to the tournament. So I wrote this code after much research to filter all the rows that had NCAA in the name.

**mask1 = WinnerStats['School'].str.contains('NCAA', case=False, na=False)
ws1 = WinnerStats[mask1]**

This worked.

      Rk             School  Games  Wins  Loss   W-L%    SRS    SOS  Wins.1  Loss.1  ...   3PAr    TS%  TRB%  AST%  STL%  BLK%   eFG%  TOV%  ORB%  FT/FGA
0      1  Abilene Christian     29    24     5  0.828   6.27  -6.37      13       2  ...  0.367  0.540  51.5  65.3  13.5   8.9  0.512  16.2  31.3   0.230
6      7            Alabama     33    26     7  0.788  19.58  10.01      16       2  ...  0.465  0.544  51.7  50.2  11.3  10.6  0.517  15.9  31.8   0.204
10    11  Appalachian State     29    17    12  0.586  -5.84  -5.72       7       8  ...  0.437  0.520  50.8  49.9  11.6  10.4  0.478  14.9  29.8   0.252

[68 rows x 29 columns]

So, now I'm at the point where I want to start using the data to predict some outcomes (this is old data so I already know what the program should be able to predict). I need to match the two teams up T1 vs T2. I have another dataset that has the match ups, but it's by team ID. I have a third datasets that has team name and teamID.

**TeamID = pd.read_csv('MTeams.csv')
mask1 = WinnerStats['School'].str.contains('NCAA', case=False, na=False)
ws1 = WinnerStats[mask1] # just the NCAA teams
ws1['School'] = WinnerStats['School'].str.slice(start=0, stop=-5) #Remove NCAA from the school name**

This code gets all rows that has NCAA in the name via mask into ws1. But the last line to remove the NCAA works, but gives this warning;

mm2021_Data_Cleaner.py:46: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ws1['School'] = WinnerStats['School'].str.slice(start=0, stop=-5)

I don't know enough about python to say, I can ignore the warning, or make some kind change. I followed the link, and none of the explanation seem to make any sense to me and what I'm trying to do.

The second problem is trying to code up a way to compare the names from the TeamID to the WinningStats. To match up TeamID to WinnerStat I needs to match up on "similar school name"; Example. "Abilene Christian" (from WinnerStat) to "Abilene Chr" (TeamID). So I this code

**ws1['TeamID'] = TeamID['TeamID'].where(WinnerStats['School'].str.contains(TeamID['School']))**

gives this error;

Traceback (most recent call last):
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3621, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 136, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 163, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'School'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/March_Madness/mm2021_Data_Cleaner.py", line 46, in <module>
    ws1['TeamID'] = TeamID['TeamID'].where(WinnerStats['School'].str.contains(TeamID['School']))
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/frame.py", line 3505, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3623, in get_loc
    raise KeyError(key) from err
KeyError: 'School'
dennis@MBP2021 March_Madness %

I have tried several different permutations, other code that I'm pretty lost and the "think" this is the last code I tried. All with similar strange error messages.

To address the SettingWithCopyWarning, change `ws1 = WinnerStats[mask1]` to `ws1 = WinnerStats[mask1].copy()`; see https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas. Could you edit your question to include the code that produced the error? — Peter Leimbigler, May 02 '22 at 02:01
Sorry, I thought I put that code in there. Here is the code, and I have updated the Q. ws1['TeamID'] = TeamID['TeamID'].where(WinnerStats['School'].str.contains(TeamID['School'])) — BigD, May 02 '22 at 03:03
Thanks Peter, that cleared the warning. I'm not sure I understand all in that link. Unfamiliarity with how these data structures are described. Thanks again. — BigD, May 02 '22 at 03:19
Please trim your code to make it easier to find your problem. Follow these guidelines to create a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example). — Community, May 02 '22 at 14:51
A bot said to make the code more visible; here is the code: **TeamID = pd.read_csv('MTeams.csv') # get just the NCAA teams mask1 = WinnerStats['School'].str.contains('NCAA', case=False, na=False) #apply the mask ws1 = WinnerStats[mask1] #Remove NCAA from the school name** ws1['School'] = WinnerStats['School'].str.slice(start=0, stop=-5) #link up their file to the winners file on a fuzzy match between school #names ws1['TeamID'] = TeamID['TeamID'].where(WinnerStats['School'].str.contains(TeamID['School'])) see above for error message — BigD, May 03 '22 at 00:47
@BigD - not quite sure what you're referring to, but code needs to be properly formatted, and placed in your question, not as a comment. Also, I spent time formatting the code in your question for readability - what other issue(s) are you trying to solve? — David Makogon, May 03 '22 at 03:14
What I'm trying to do is take 1 column from a file, and append it to another file. File 1 has all the teams stats, by date and school name. File 2 has the match up for every game. However it uses a teamid only. File 3 has the school name and school ID. However the school name is not a 1 to 1 match, school name has some abbreviations. I want to take F3 school id and appended it to F1 matched on a fuzzy school name. I also need to get what school, a school played. Info in F2 appended to F1 — BigD, May 04 '22 at 19:38
By the way, how did you format that org Q, I tried following a link that explains it. But it didn't come out that way. — BigD, May 04 '22 at 19:39

Massaging data issues for an AI learning project using python which I'm also learning

0 Answers0