Suppose I have some dataframes of the form
id time value
0 100 2012-04-01 2.3
1 100 2012-04-02 3.4
2 100 2012-04-03 1.4
3 100 2012-04-04 5.6
4 200 2012-04-01 NaN
5 200 2012-04-02 2.9
6 200 2012-04-03 2.8
7 200 2012-04-04 1.9
8 300 2012-04-01 3.1
9 300 2012-04-02 NaN
10 300 2012-04-03 2.5
11 300 2012-04-04 1.2
So we have the id of the timeseries (here 100,200,300) and for each time series we have the values corresponding to 4 different timesteps ( 2012-04-01 to 2012-04-04).
I want to create a function that takes such a dataframe, pivots it and then sums over the rows. However, I want this function to have a parameter that decides how to sum in the case one of the values is a nan value. I have 3 cases:
- nan is considered as 0 (e.g a + b + NaN = a+b)
- if nan is in the sum, the whole sum is zero (e.g. a + b + NaN = 0)
- if nan is in the sum, the whole sum is nan (e.g. a + b + NaN = Nan)
My Try
First of all, looking here I know I can pivot the df using df.pivot('time', 'id', 'value')
obtaining
time 100 200 300
2012-04-01 2.3 3.0 3.1
2012-04-02 3.4 2.9 4.2
2012-04-03 1.4 2.8 2.5
2012-04-04 5.6 1.9 1.2
Then the function could look like
def pivot_sum(df, method = 'zero'):
# NaN are treated as zero: a + b + NaN = a + b
if method == 'zero':
return pd.DataFrame(df.groupby('time')['value'].sum())
else:
# First we pivot it
pivoted = df.pivot('time', 'id', 'value')
# Then we choose
if method == 'nan_make_zero_sum':
# Do something
elif method == 'nan_make_nan_sum':
# DO something else
I am quite stuck on how to proceed. I noticed that using the groupby option behaves as if nan values are treated as zeros. However, I don't know how to implement the others.
Note: I noticed that the new version of pandas has got a parameter min_count
which could be useful (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sum.html ) , however I have version 0.21 and due to other problems, changing the version is not a good solution.
Edit
I manage to write it so that it works for pandas version 0.22, however I would like to know how one would write a function like that for previous versions. Here it is anyway
def sum_values(df, method = 'skip'):
if method == 'skip':
# Then we treat NaN values as zeros
return pd.DataFrame(df.groupby('time')['value'].sum())
else:
# First we pivot the df
df_pivoted = df.pivot('time', 'id', 'value')
# Then we check
if method == 'nan_sum':
return pd.DataFrame(df_pivoted.sum(axis = 1, min_count = df_pivoted.shape[1])).rename(columns = {0:'value'})
elif method == 'zero_sum':
# do the same as above
return pd.DataFrame(df_pivoted.sum(axis = 1, min_count = df_pivoted.shape[1])).rename(columns = {0:'value'}).fillna(0)
Edit 2 - What it looks like
Given the function above, this is what the process looks like.
I have a dataframe
id time value
0 100 2012-04-01 2.3
1 100 2012-04-02 3.4
2 100 2012-04-03 1.4
3 100 2012-04-04 5.6
4 200 2012-04-01 NaN
5 200 2012-04-02 2.9
6 200 2012-04-03 2.8
7 200 2012-04-04 1.9
8 300 2012-04-01 3.1
9 300 2012-04-02 NaN
10 300 2012-04-03 2.5
11 300 2012-04-04 1.2
and applying the function above
sum_values(df, method = 'skip')
returns
value
time
2012-04-01 5.4
2012-04-02 6.3
2012-04-03 6.7
2012-04-04 8.7
sum_values(df, method = 'nan_sum')
returns
value
time
2012-04-01 NaN
2012-04-02 NaN
2012-04-03 6.7
2012-04-04 8.7
sum_values(df, method = 'zero_sum')
returns
value
time
2012-04-01 0.0
2012-04-02 0.0
2012-04-03 6.7
2012-04-04 8.7