1

I'd like to calculate the following sum(size):

sample data

The sum(size) is a sum of col size, and for each col i'd to sum only up the last non-0/NaN row. Is this possible in python?

I've tried to add df.apply(pd.Series.last_valid_index) to identify the last index for each col, but i don't know how to incorporate into the over query to get the sum for each col.

Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52
sarahf
  • 11
  • 2
  • If you are using Pandas, you should add the Pandas tag to your question. – Jarad Mar 08 '23 at 22:57
  • 1
    Welcome to Stack Overflow! Check out the [tour]. [Please don't post pictures of data](https://meta.stackoverflow.com/q/285551/4518341). Instead, copy the text itself, [edit] it into your post, and use the formatting tools like [code formatting](/editing-help#code). For more tips, check out [How to ask a good question](/help/how-to-ask), [mre], and [How to make good reproducible pandas examples](/q/20109391/4518341). – wjandrea Mar 08 '23 at 23:12
  • Refrain from showing your dataframe as an image. Your question needs a minimal reproducible example consisting of sample input, expected output, actual output, and only the relevant code necessary to reproduce the problem. See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) for best practices related to Pandas questions. – itprorh66 Mar 09 '23 at 00:29

3 Answers3

2

You can try something like this:

import pandas as pd
import numpy as np

df = pd.DataFrame([[5,5,4,3],[7,6,6,np.nan],[4,np.nan,4,2],[3,3,np.nan,np.nan],[6,np.nan,np.nan, np.nan]])
s=df.apply(pd.Series.last_valid_index)
df.loc['Sum',:] = [df.iloc[:i+1, 0].sum() for i in s]

print(df)

Output:

        0     1     2     3
0     5.0   5.0   4.0   3.0
1     7.0   6.0   6.0   NaN
2     4.0   NaN   4.0   2.0
3     3.0   3.0   NaN   NaN
4     6.0   NaN   NaN   NaN
Sum  25.0  19.0  16.0  16.0
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    Thanks Scott, this is fantastic! I have tried months trying to figure out this on my own with my very limited Python knowledge. Many thanks for your help! – sarahf Mar 09 '23 at 03:42
  • @sarahf Happy coding. There are probably other ways. I think this solution will give others better ideas on how to solve your problem. – Scott Boston Mar 09 '23 at 03:45
  • 1
    Hi Scott, one last question. Is it possible to keep the month col in the table? I get an error msg "cannot perform sum with type period[M]" if i dont drop this col. Thanks! – sarahf Mar 09 '23 at 16:15
  • I would move the month into the dataframe index using `df = df.set_index('month')` then do the above statements and afterward `df = df.reset_index()` to get month back into the dataframe columns. – Scott Boston Mar 09 '23 at 16:21
  • I actually tried that, but got error msg. I have a feeling that by changing the index it kind of mess up the last_valid_index and it's following calculation. I did get to keep the Month col by manipulating your coding to s=df.drop(columns=[0]).apply(pd.Series.last_valid_index) df.loc['Sum',1:] = [df.iloc[:i+1, 1].sum() for i in s]. I'd still love to see how it would work with set_index if it's not too much work. Many thanks. – sarahf Mar 09 '23 at 16:44
1

Here is a vectorial approach:

s = (
 df.filter(like='m')
   .notna()[::-1].cummax()
   .mul(df['size'], axis=0)
   .sum()
 )

s['size'] = df['size'].sum()

df.loc['sum'] = s

print(df)

Output:

     size    m1    m2    m3
0       5   5.0   4.0   3.0
1       7   6.0   6.0   NaN
2       4   NaN   4.0   2.0
3       3   3.0   NaN   NaN
4       6   NaN   NaN   NaN
sum    25  19.0  16.0  16.0

Intermediates:

(df.filter(like='m')
   .notna()[::-1].cummax()
 )

      m1     m2     m3
0   True   True   True
1   True   True   True
2   True   True   True
3   True  False  False
4  False  False  False


(df.filter(like='m')
   .notna()[::-1].cummax()
   .mul(df['size'], axis=0)
 )

   m1  m2  m3
0   5   5   5
1   7   7   7
2   4   4   4
3   3   0   0
4   0   0   0
mozway
  • 194,879
  • 13
  • 39
  • 75
0

Once your df is cleaned you will not have empty cells, they most likely will be filled with 'nan' or zeros. Also in your picture you have an 'addition' error. If you add the numbers from the table the count will not match the count in that picture. This will add a column at the end of your df that will contain a count:

import pandas as pd

df = pd.DataFrame({'size':[5,7,4,3,6], 'm1': [5,6, 0, 3, 0], 'm2': [4,6,4,0,0], 'm3': [3, 0, 2, 0, 0]})
df1 = pd.DataFrame({'size':[df['size'].sum()], 'm1':[df['m1'].sum()], 'm2':[df['m2'].sum()], 'm3':[df['m3'].sum()]})

df = pd.concat([df, df1], ignore_index=True)

old_index = df.index.to_list()
df.rename(index={old_index[-1]:'Sum'}, inplace=True)
print(df)

OUTPUT:

     size  m1  m2  m3
0       5   5   4   3
1       7   6   6   0
2       4   0   4   2
3       3   3   0   0
4       6   0   0   0
Sum    25  14  14   5

I did not reproduce your index as it was irrelevant.

I personally would put it in a different DF instead of creating a row, but that is a personal choice.

EDITED

I am kinda confuse if you want to calculate upto first or last non NaN value, so here is a few options.

if you want to calculate upto first non nan value then do this:

import pandas as pd

import numpy as np

df = pd.DataFrame({'size':[5,7,4,3,6], 'm1': [5,6, np.nan, 3, np.nan], 'm2': [4,6,4,np.nan,np.nan], 'm3': [3, np.nan, 2, np.nan, np.nan]})

df.dropna(inplace=True)

df1 = pd.DataFrame({'size':[df['size'].sum()], 'm1':[df['m1'].sum()], 'm2':[df['m2'].sum()], 'm3':[df['m3'].sum()]})

df = pd.concat([df, df1], ignore_index=True)

old_index = df.index.to_list()
df.rename(index={old_index[-1]:'Sum'}, inplace=True)
print(df)

OUTPUT:

     size   m1   m2   m3
0       5  5.0  4.0  3.0
Sum     5  5.0  4.0  3.0

That is assuming you cleaned data.

OR

If you want to keep all data but still count only up to first not non value:

import pandas as pd
import numpy as np

df = pd.DataFrame({'size':[5,7,4,3,6], 'm1': [5, 6, np.nan, 3, np.nan], 'm2': [4,6,4,np.nan,np.nan], 'm3': [3, np.nan, 2, np.nan, np.nan]})

df2 = df.copy()

df2.dropna(inplace=True)

df1 = pd.DataFrame({'size':[df2['size'].sum()], 'm1':[df2['m1'].sum()], 'm2':[df2['m2'].sum()], 'm3':[df2['m3'].sum()]})
df = pd.concat([df, df1], ignore_index=True)

old_index = df.index.to_list()
df.rename(index={old_index[-1]:'Sum'}, inplace=True)
print(df)

OUTPUT:

     size   m1   m2   m3
0       5  5.0  4.0  3.0
1       7  6.0  6.0  NaN
2       4  NaN  4.0  2.0
3       3  3.0  NaN  NaN
4       6  NaN  NaN  NaN
Sum     5  5.0  4.0  3.0

Though I still don't get how you got your count.

Dennis
  • 78
  • 7
  • Thanks Dennis for your reply. I think I confused people with the sum placed at the end of the table like that. But I really can't think of a proper way to show it. What im looking for are: the sum of m1=5+7+4+3=19; sum of m2=5+7+4=16, sum of m3=5+7+4=16. The sum can be shown as a separate table/list. – sarahf Mar 09 '23 at 02:20
  • I am sorry I don't understand the 'logic' of how you want to combine numbers. If you can provide more info, I'd love to help you. – Dennis Mar 09 '23 at 02:39
  • For col m1, the last non 0 value is 3, whitch is at month 2018-11, so I would want sum of the Size values from 2018-7 to 2018-11, therefore 5+7+4+3=19. For m2, the last non 0 value is at month 2018-10, so i would want the sum of Size col from Month 2018-07 to 2018-10, therefore 5+7+4=16. I hope this makes sense now. Thanks. – sarahf Mar 09 '23 at 03:38