Performance of looping with pandas

Question

I originally asked a question in Data Science community:

I have a table formatted like the following table:
Feature amount    ID  
Feat1    2        1   
Feat2    0        1   
Feat3    0        1   
Feat4    1        1   
Feat2    2        2   
Feat4    0        2   
Feat3    0        2   
Feat6    1        2 
Let's say I have 200 different IDs. I want to convert all different features into variables and amount into observations, so I combine rows with same ID into one row. For example,
Feat1 Feat2 Feat3 Feat4 Feat5 Feat6 ID 
  2     0     0     1    NA    NA   1    
 NA     2     0     0    NA    1    2    
Is there a good way to do it either in Python (pandas) or R?

And this is the answer I got:

newdata = pd.DataFrame(columns=['ID', 'Location', 'Feat1', 'Feat2', 'Feat3', 'Feat4', 'Feat5', 'Feat6'])
grouped = data.groupby(['ID', 'Location'])

for index, (group_name, d) in enumerate(grouped):
    newdata.loc[index, 'ID'] = group_name[0]
    newdata.loc[index, 'Location'] = group_name[1]
    for feature, amount in zip(d['Feature'], d['amount']):
        newdata.loc[index, feature] = amount

After more Google search, I found the answer of this question says:

So try to avoid the Python loop for i, row in enumerate(...) entirely

I am wondering, regarding to my original question, is there a more efficient way?

Alexander · Accepted Answer · 2016-03-04T02:11:44.350

I believe this is what you're after.

>>> df.pivot_table(values='amount', index='ID', columns='Feature')
Feature  Feat1  Feat2  Feat3  Feat4  Feat6
ID                                        
1            2      0      0      1    NaN
2          NaN      2      0      0      1

Depending on your data and needs, there are variations. For example:

>>> df.pivot_table(values='amount', index='ID', columns='Feature', 
                   aggfunc=np.sum, fill_value=0)
Feature  Feat1  Feat2  Feat3  Feat4  Feat6
ID                                        
1            2      0      0      1      0
2            0      2      0      0      1

Thanks for your reply. I deleted the location column, it is not a factor to this question. — Adam Freeland, Mar 04 '16 at 02:10

Performance of looping with pandas

1 Answers1