1

Suppose I have some dataframes of the form

     id        time  value
0   100  2012-04-01    2.3
1   100  2012-04-02    3.4
2   100  2012-04-03    1.4
3   100  2012-04-04    5.6
4   200  2012-04-01    NaN
5   200  2012-04-02    2.9
6   200  2012-04-03    2.8
7   200  2012-04-04    1.9
8   300  2012-04-01    3.1
9   300  2012-04-02    NaN
10  300  2012-04-03    2.5
11  300  2012-04-04    1.2

So we have the id of the timeseries (here 100,200,300) and for each time series we have the values corresponding to 4 different timesteps ( 2012-04-01 to 2012-04-04).

I want to create a function that takes such a dataframe, pivots it and then sums over the rows. However, I want this function to have a parameter that decides how to sum in the case one of the values is a nan value. I have 3 cases:

  • nan is considered as 0 (e.g a + b + NaN = a+b)
  • if nan is in the sum, the whole sum is zero (e.g. a + b + NaN = 0)
  • if nan is in the sum, the whole sum is nan (e.g. a + b + NaN = Nan)

My Try

First of all, looking here I know I can pivot the df using df.pivot('time', 'id', 'value') obtaining

    time      100    200    300
2012-04-01    2.3    3.0    3.1
2012-04-02    3.4    2.9    4.2
2012-04-03    1.4    2.8    2.5
2012-04-04    5.6    1.9    1.2

Then the function could look like

def pivot_sum(df, method = 'zero'):
    # NaN are treated as zero: a + b + NaN = a + b
    if method == 'zero':
        return pd.DataFrame(df.groupby('time')['value'].sum())
    else:
        # First we pivot it
        pivoted = df.pivot('time', 'id', 'value')
        # Then we choose
        if method == 'nan_make_zero_sum':
            # Do something
        elif method == 'nan_make_nan_sum':
            # DO something else

I am quite stuck on how to proceed. I noticed that using the groupby option behaves as if nan values are treated as zeros. However, I don't know how to implement the others.

Note: I noticed that the new version of pandas has got a parameter min_count which could be useful (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html ) , however I have version 0.21 and due to other problems, changing the version is not a good solution.

Edit

I manage to write it so that it works for pandas version 0.22, however I would like to know how one would write a function like that for previous versions. Here it is anyway

def sum_values(df, method = 'skip'):
    if method == 'skip':
        # Then we treat NaN values as zeros
        return pd.DataFrame(df.groupby('time')['value'].sum())

    else:
        # First we pivot the df
        df_pivoted = df.pivot('time', 'id', 'value')
        # Then we check
        if method == 'nan_sum':
            return pd.DataFrame(df_pivoted.sum(axis = 1, min_count = df_pivoted.shape[1])).rename(columns = {0:'value'})
        elif method == 'zero_sum':
            # do the same as above
            return pd.DataFrame(df_pivoted.sum(axis = 1, min_count = df_pivoted.shape[1])).rename(columns = {0:'value'}).fillna(0)

Edit 2 - What it looks like

Given the function above, this is what the process looks like.

I have a dataframe

     id        time  value
0   100  2012-04-01    2.3
1   100  2012-04-02    3.4
2   100  2012-04-03    1.4
3   100  2012-04-04    5.6
4   200  2012-04-01    NaN
5   200  2012-04-02    2.9
6   200  2012-04-03    2.8
7   200  2012-04-04    1.9
8   300  2012-04-01    3.1
9   300  2012-04-02    NaN
10  300  2012-04-03    2.5
11  300  2012-04-04    1.2

and applying the function above

sum_values(df, method = 'skip') returns

            value
   time        
2012-04-01    5.4
2012-04-02    6.3
2012-04-03    6.7
2012-04-04    8.7

sum_values(df, method = 'nan_sum') returns

            value
   time        
2012-04-01    NaN
2012-04-02    NaN
2012-04-03    6.7
2012-04-04    8.7

sum_values(df, method = 'zero_sum') returns

            value
   time        
2012-04-01    0.0
2012-04-02    0.0
2012-04-03    6.7
2012-04-04    8.7
Euler_Salter
  • 3,271
  • 8
  • 33
  • 74

1 Answers1

1

I have created below 3 functions which would give the desired output:

Function 1

import numpy as np
def sum_val(df):
    df1=df.copy()
    df1['value'] = pd.to_numeric(df1['value'], errors='coerce')
    df2=df1.fillna(0.0)
    df2['sum_nan']=df2.groupby(['time_val']).transform('sum')
    df3=df2.drop('value', inplace=False, axis=1)
    df4=df3.drop_duplicates()
    return (df4)
sum_val(df)

Output:

     time_val  sum_nan
0  2012-04-01      5.4
1  2012-04-02      6.3
2  2012-04-03      6.7
3  2012-04-04      8.7

Function 2

def sum_nan(df):

    df1=pd.DataFrame()
    df1=df.copy()

    df1['sum_nan']=df1.groupby(['time_val']).transform('sum')

    df2=df1.drop('value', inplace=False, axis=1)
    df2=df2.drop_duplicates()
    return df2



 sum_nan(df)

Output:

time_val    sum_nan
0   2012-04-01  NaN
1   2012-04-02  NaN
2   2012-04-03  6.7
3   2012-04-04  8.7

Function 3

def sum_0(df):

    df1=pd.DataFrame()
    df1=df.copy()
    df1['sum_nan']=df1.groupby(['time_val']).transform('sum')
    df2=df1.drop('value', inplace=False, axis=1)
    df2=df2.replace(np.nan,0)
    df3=df2.drop_duplicates()
    return(df3)

sum_0(df)

Output:

     time_val  sum_nan
0  2012-04-01      0.0
1  2012-04-02      0.0
2  2012-04-03      6.7
3  2012-04-04      8.7
Pang
  • 9,564
  • 146
  • 81
  • 122
user15051990
  • 1,835
  • 2
  • 28
  • 42