0

Suppose I have the following data frame:

import pandas as pd

df = pd.DataFrame()
df['ID'] = 1, 1, 1, 2, 2, 3, 3
df['a'] = 3, 5, 6, 3, 8, 1, 2

I want to create a for loop that loops over ID and returns the sum of 'a' for that ID. So far I have this:

for i in df['ID']:
    print(i, df.loc[df['ID'] == i, 'a'].sum())

However this returns multiples of the same value like so:

1 14
1 14
1 14
2 11
2 11
3 3
3 3

How do I edit my pool so that once it has returned the value for 'id' == 1 it moves on to the next id value rather than just down to the next row?

I'm looking to get the following:

1 14
2 11
3 3

Thanks in advance!

tom91
  • 685
  • 7
  • 24

1 Answers1

1

This is much better suited to groupby rather than looping (as are many pandas dataframe problems):

>>> df.groupby('ID')['a'].sum()
ID
1    14
2    11
3     3
Name: a, dtype: int64

However, just to explain where your loop went wrong, you can just loop through the unique values of df['ID'], rather than all rows:

for i in df['ID'].unique():
    print(i, df.loc[df['ID'] == i, 'a'].sum())

1 14
2 11
3 3
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • Wonderful, both of these work nicely! Could you explain why groupby is better than a loop in this instance other than its a single line – tom91 Nov 27 '18 at 15:48
  • 1
    for starters, it's faster (for large dataframes), less error prone, and leads to cleaner code. [Here is an article about it :)](https://engineering.upside.com/a-beginners-guide-to-optimizing-pandas-code-for-speed-c09ef2c6a4d6) – sacuL Nov 27 '18 at 15:49
  • 1
    @ sacul Thank you for linking that I'll check it out, and thank you for answering the loop question as well! – tom91 Nov 27 '18 at 15:59