Another pandas question.
Reading Wes Mckinney's excellent book about Data Analysis and Pandas, I encountered the following thing that I thought should work:
Suppose I have some info about tips.
In [119]:
tips.head()
Out[119]:
total_bill tip sex smoker day time size tip_pct
0 16.99 1.01 Female False Sun Dinner 2 0.059447
1 10.34 1.66 Male False Sun Dinner 3 0.160542
2 21.01 3.50 Male False Sun Dinner 3 0.166587
3 23.68 3.31 Male False Sun Dinner 2 0.139780
4 24.59 3.61 Female False Sun Dinner 4 0.146808
and I want to know the five largest tips in relation to the total bill, that is, tip_pct
for smokers and non-smokers separately. So this works:
def top(df, n=5, column='tip_pct'):
return df.sort_index(by=column)[-n:]
In [101]:
tips.groupby('smoker').apply(top)
Out[101]:
total_bill tip sex smoker day time size tip_pct
smoker
False 88 24.71 5.85 Male False Thur Lunch 2 0.236746
185 20.69 5.00 Male False Sun Dinner 5 0.241663
51 10.29 2.60 Female False Sun Dinner 2 0.252672
149 7.51 2.00 Male False Thur Lunch 2 0.266312
232 11.61 3.39 Male False Sat Dinner 2 0.291990
True 109 14.31 4.00 Female True Sat Dinner 2 0.279525
183 23.17 6.50 Male True Sun Dinner 4 0.280535
67 3.07 1.00 Female True Sat Dinner 1 0.325733
178 9.60 4.00 Female True Sun Dinner 2 0.416667
172 7.25 5.15 Male True Sun Dinner 2 0.710345
Good enough, but then I wanted to use pandas' transform to do the same like this:
def top_all(df):
return df.sort_index(by='tip_pct')
tips.groupby('smoker').transform(top_all)
but instead I get this:
TypeError: Transform function invalid for data types
Why? I know that transform requires to return an array of the same dimensions that it accepts as input, so I thought I'd be complying with that requirement just sorting both slices (smokers and non-smokers) of the original DataFrame without changing their respective dimensions. Can anyone explain why it failed?