4

I have four columns in a data frame like so:

       A   B     C        D
75472  d1  x    -36.0   0.0
75555  d2  x    -38.0   0.0
75638  d3  x    -18.0   0.0
75721  d4  x    -18.0   1836.0
75804  d5  x    1151.0  0.0
75887  d6  x    734.0   0.0
75970  d7  x    -723.0  0.0

And I want to conditionally sum D by:

  • D has a value, then D
  • Else, take value from previous row for D plus C

So for above, D would be [-36, -74, -92, 1836, 2987, 3721, 2998].

I've been able to do this successfully with a for loop

for i, row in me.iterrows():
    try:
        if row['D'] > 0:
            step1 = me.loc[(me['B'] == row['B']) & (me['A'] == row['A']), 'output'].iloc[0]
            me_copy.iloc[i, me_copy.columns.get_loc('output')] = step1
        else:
            step1 = me.loc[(me['B'] == row['B']) & (me['A'] == (row['A'] - pd.DateOffset(days=1))), 'step1'].iloc[0]
            receipts_adjustments_sales = me.loc[(me['B'] == row['B']) & (me['A'] == row['A']), 'C'].iloc[0]
            me_copy.iloc[i, me_copy.columns.get_loc('output')] = step1 + receipts_adjustments_sales
    except:
        me_copy.iloc[i, me_copy.columns.get_loc('output')] = 0

But the for loop is obviously really expensive, anti-pattern and basically doesn't run over my whole data frame. I'm trying to copy an excel function here that has basically been written over a panel of data, and for the life of me I cannot figure out how to do this with:

  • pd.Series.shift()
  • pd.Series.rolling()
  • Simply calculating different column values

I was attempting to do it with shift() for a while, but I realized that I kept having to make a separate column for each row, and that's why I went with a for loop.

Generalized to Groups

df.loc[:, 'A_group'] = df.groupby(['A'])[df['D'] != 0].cumsum()
df.loc[:, 'E'] = df['D'].mask(df['D'] == 0).combine_first(df['C'])
df.loc[:, 'F'] = me.groupby(['A', 'A_group'])['E'].cumsum()

Thanks to Scott Boston for the help!

gr1zzly be4r
  • 2,072
  • 1
  • 18
  • 33

2 Answers2

5

Here is a way to do it:

grp = (df['D'] != 0).cumsum()
df['D_new'] = df['D'].mask(df['D'] == 0).combine_first(df['C']).groupby(grp).cumsum()
df

Output:

        A  B       C       D   D_new
75472  d1  x   -36.0     0.0   -36.0
75555  d2  x   -38.0     0.0   -74.0
75638  d3  x   -18.0     0.0   -92.0
75721  d4  x   -18.0  1836.0  1836.0
75804  d5  x  1151.0     0.0  2987.0
75887  d6  x   734.0     0.0  3721.0
75970  d7  x  -723.0     0.0  2998.0

Details:

Create grps to help cumsum. Each group is defined the the appears of a value in 'D' hence you stop cumsum before and pick that value of D and continue cumsum until the next value of 'D'

grp = (df['D'] != 0).cumsum()

Output:

        A  B       C       D   D_new  grp
75472  d1  x   -36.0     0.0   -36.0    0
75555  d2  x   -38.0     0.0   -74.0    0
75638  d3  x   -18.0     0.0   -92.0    0
75721  d4  x   -18.0  1836.0  1836.0    1
75804  d5  x  1151.0     0.0  2987.0    1
75887  d6  x   734.0     0.0  3721.0    1
75970  d7  x  -723.0     0.0  2998.0    1

Now, Let's create new column combining 'C' and 'D' when D has a nonzero number

df['newCD'] = df['D'].mask(df['D'] == 0).combine_first(df['C'])

Output:

        A  B       C       D   D_new  grp   newCD
75472  d1  x   -36.0     0.0   -36.0    0   -36.0
75555  d2  x   -38.0     0.0   -74.0    0   -38.0
75638  d3  x   -18.0     0.0   -92.0    0   -18.0
75721  d4  x   -18.0  1836.0  1836.0    1  1836.0
75804  d5  x  1151.0     0.0  2987.0    1  1151.0
75887  d6  x   734.0     0.0  3721.0    1   734.0
75970  d7  x  -723.0     0.0  2998.0    1  -723.0

And, lastly, groupby 'grp' and cumsum newCD:

df['D_new_Details'] = df.groupby('grp')['newCD'].cumsum()

Output:

        A  B       C       D   D_new  grp   newCD  D_new_Details
75472  d1  x   -36.0     0.0   -36.0    0   -36.0          -36.0
75555  d2  x   -38.0     0.0   -74.0    0   -38.0          -74.0
75638  d3  x   -18.0     0.0   -92.0    0   -18.0          -92.0
75721  d4  x   -18.0  1836.0  1836.0    1  1836.0         1836.0
75804  d5  x  1151.0     0.0  2987.0    1  1151.0         2987.0
75887  d6  x   734.0     0.0  3721.0    1   734.0         3721.0
75970  d7  x  -723.0     0.0  2998.0    1  -723.0         2998.0
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • This works, thank you! Is there a way to generalize it to a group of A, B? Also - is there good docs on this? Some of these one liners in Pandas are like wizardry to me even after reading the documentation. – gr1zzly be4r Jul 16 '20 at 13:17
  • @gr1zzlybe4r Nope, I had a very [similar problem to this using SQL](https://stackoverflow.com/q/47796227/6361531) a while back. I use that SQL logic to write a pandas solution and added that algorithm to my toolbox. – Scott Boston Jul 16 '20 at 13:19
  • @gr1zzlybe4r I added some details to the solution hope it helps. – Scott Boston Jul 16 '20 at 13:20
  • No as in no way to generalize it to a group of A, B? Your explanation is very helpful, thank you! – gr1zzly be4r Jul 16 '20 at 13:23
  • @gr1zzlybe4r Oh.. I am sure we can add A and B to the groupings. I was stating no to an documentation on this particular algorithm. At least, I don't know of any. You just need to modify the `grp` statement to create your desired groups with A and B. – Scott Boston Jul 16 '20 at 13:25
  • 1
    Yeah so to add A and B I think it's `grp = df.groupby(['A', 'B', df['D'] > 0]).cumsum() df['step1'].mask(df['step1'] == 0).combine_first(df['receipts_adjustments_sales']).groupby(grp).cumsum()` ? – gr1zzly be4r Jul 16 '20 at 13:28
  • 1
    Grouping is a little bit tougher, but I will get it and post it in my question once I figure it out – gr1zzly be4r Jul 16 '20 at 13:59
  • Updated my question with the group generalization – gr1zzly be4r Jul 16 '20 at 14:28
1

Another, similar to Scott's answer:

groups = df['D'].ne(0).cumsum()
df['new'] = (df['C'].where(df['D'].eq(0), df['D'])
     .groupby(groups)
     .cumsum()
)
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74