0

I have the DataFrame:

df = np.DataFrame = {'Year' : [2010, 2011, 2012, 2013, 1922, 1923, 1924, 1925],  
                     'ID' : ['A', 'A', 'A', 'A', 'B', 'B', 'B'],
                     'Data1' : [1, 2, 3, 4, 2, 3, 4], 
                     'Data2' : [2, 2, 2, 2, 3, 3, 3]}

       df
Year ID Data1 Data2
2010  A   1     2
2011  A   2     2
2012  A   3     2
2013  A   4     2
1922  B   2     3
1923  B   3     3
1924  B   4     3

I would like to sum up total Data1 and Data2 up for each ID so that each year gives the sum of all Data1 or Data2 up to that year.

Ultimately, I would like my DataFrame to look like this:

    df
Year ID Data1 Data2
2010 A    1     2
2011 A    3     4
2012 A    6     6
2013 A    10    8
1922 B    2     3
1923 B    5     6
1924 B    9     9

I have over 100k rows, so this needs to be done without iteration if possible.

Jcb Rb
  • 61
  • 7

1 Answers1

1

use assign and groupby

df = df.assign(**df.groupby('ID')['Data1', "Data2"].cumsum())
print(df)

    Year   ID  Data1    Data2
0   2010    A    1       2
1   2011    A    3       4
2   2012    A    6       6
3   2013    A    10      8
4   1922    B    2       3
5   1923    B    5       6
6   1924    B    9       9
Abhi
  • 4,068
  • 1
  • 16
  • 29
  • 1
    Perfect, thanks. What do the double asterisks imply? Also, how would I do this if the years were not in order? – Jcb Rb Aug 30 '18 at 16:48
  • `**kwargs` is used to allow multiple number of arguments to a function. In this case for `assign`. [read more](https://stackoverflow.com/questions/36901/what-does-double-star-asterisk-and-star-asterisk-do-for-parameters) – Abhi Aug 30 '18 at 16:51