0

Actually am unsure if the end of this is cross-section because it's over a time period, but I think it is still.

I have a data frame that looks like this:

Player          Finish  Tournament  Year    id
------------------------------------------------
Aaron Baddeley  9       Memorial    2012    1
Aaron Baddeley  17      Masters     2013    1
Aaron Watkins   15      US Open     2012    2
Adam Scott      9       US Open     2014    3
Adam Scott      4       Memorial    2014    3
Alex Cejka      8       US Open     2010    4
Andres Romero   2       Memorial    2012    5
Andrew Svoboda  19      Memorial    2014    6
Andy Sullivan   13      Memorial    2015    7

I want to convert this data to single observations, with the desired output like this:

Player           2012_Memorial    2013_Memorial    2014_Memorial   ...  id
----------------------------------------------------------------------------
Aaron Baddeley        9                 17              2012             1
Adam Scott            NA                NA               9               3 
.
. 
.

I've found the split-apply-combine paradigm, which looks promising. But even on the surface, I've done df.groupby('id') and a print statement outputs this:

               Player  Finish Tournament  Year 
id                                                                        
1      Aaron Baddeley       9   Memorial  2012 
2       Aaron Watkins      15    US Open  2012 
3          Adam Scott       9    US Open  2014 

So it seems to have collapsed the groups, but I've now lost data? Or how is the object now stored? I realize I haven't done the apply stage, which is probably how I will generate new rows and new columns, but I don't know the next step or if there's a cookbook example for something like this.

Thanks, Jared

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185
Jared
  • 3,651
  • 11
  • 39
  • 64

1 Answers1

1

It looks like you're looking for a way to do a pivot_table operation.

Starting with an approximation of your table:

df = pd.DataFrame({
    'player': ['ab', 'ab', 'aw', 'as', 'as'],
    'finish': [9, 17, 15, 9, 4],
    'tournament': ['m', 'm', 'us', 'us', 'm'],
    'year': [12, 13, 12, 12, 14],
    'id': [1, 1, 2, 3, 3]})

giving:

>> df
        finish  id  player  tournament  year
0   9   1   ab  m   12
1   17  1   ab  m   13
2   15  2   aw  us  12
3   9   3   as  us  12
4   4   3   as  m   14

You can run

>> df.pivot_table(values='finish', index=['player', 'id'], columns=['year', 'tournament'])
    year    12  13  14
    tournament  m   us  m   m
player  id              
ab  1   9   NaN     17  NaN
as  3   NaN     9   NaN     4
aw  2   NaN     15  NaN     NaN

Note that it gives you a multilevel column of years/tournaments. I personally think this is cleaner, but you can easily flatten it.

Community
  • 1
  • 1
Ami Tavory
  • 74,578
  • 11
  • 141
  • 185