2

I have an unvenly distributed dataframe, such as

2013-05-16 17:33:30  485.75     NaN     NaN
2013-05-16 17:34:00  479.16     NaN     NaN
2013-05-16 17:35:30     NaN  429.90     NaN
2013-05-16 17:36:00     NaN  433.39     NaN
2013-05-16 17:37:30     NaN     NaN  415.94
2013-05-16 17:38:00     NaN     NaN  401.59
2013-05-16 17:49:30  432.23     NaN     NaN
2013-05-16 17:51:00     NaN  424.08     NaN
2013-05-16 17:52:30     NaN     NaN  411.67
2013-05-16 18:01:30  471.01     NaN     NaN
2013-05-16 18:02:00  474.11     NaN     NaN
2013-05-16 18:03:30     NaN  440.76     NaN
2013-05-16 18:04:00     NaN  438.82     NaN
2013-05-16 18:17:30  469.46     NaN     NaN
2013-05-16 18:18:00  460.93     NaN     NaN

I can treat each column separately. So for each column I can have, one, two three or even 4 consecutive values surrounded by nans. What I want to do is take only two consecutive rows at a time and substitute their values by their mean and their index also by their mean. So I will replace any two consecutive rows of values for only one row with the mean of the values and index. So the example above would become

2013-05-16 17:33:45  482.45     NaN     NaN
2013-05-16 17:35:45     NaN  431.69     NaN
2013-05-16 17:37:45     NaN     NaN  408.76
2013-05-16 17:49:30  432.23     NaN     NaN
2013-05-16 17:51:00     NaN  424.08     NaN
2013-05-16 17:52:30     NaN     NaN  411.67
2013-05-16 18:01:45  472.56     NaN     NaN
2013-05-16 18:03:45     NaN  439.78     NaN
2013-05-16 18:17:45  465.19     NaN     NaN

So the consecutive values are averaged, and the rows with only one value are left alone. I have tried things like df.resample('30s').resample('2min') or (df+df.shift(1))/2 but so far no luck. Any ideas?

Note: for each row, only one column will have values, the other columns are always going to be NaN.

TomCho
  • 3,204
  • 6
  • 32
  • 83

1 Answers1

2

You can first convert datetimeindex to Unix time, then create new column from index, get mean of each column. Last dropna and convert Unix time to datetime by to_datetime:

print df
                          a       b       c
2013-05-16 17:33:30  485.75     NaN     NaN
2013-05-16 17:34:00  479.16     NaN     NaN
2013-05-16 17:35:30     NaN  429.90     NaN
2013-05-16 17:36:00     NaN  433.39     NaN
2013-05-16 17:37:30     NaN     NaN  415.94
2013-05-16 17:38:00     NaN     NaN  401.59
2013-05-16 17:49:30  432.23     NaN     NaN
2013-05-16 17:51:00     NaN  424.08     NaN
2013-05-16 17:52:30     NaN     NaN  411.67
2013-05-16 18:01:30  471.01     NaN     NaN
2013-05-16 18:02:00  474.11     NaN     NaN
2013-05-16 18:03:30     NaN  440.76     NaN
2013-05-16 18:04:00     NaN  438.82     NaN
2013-05-16 18:17:30  469.46     NaN     NaN
2013-05-16 18:18:00  460.93     NaN     NaN
#convert to unix time (need integers from datetime for mean)
df.index = df.index.astype(np.int64) // 10**9
#create column index from df.index
df = df.reset_index()
print df
         index       a       b       c
0   1368725610  485.75     NaN     NaN
1   1368725640  479.16     NaN     NaN
2   1368725730     NaN  429.90     NaN
3   1368725760     NaN  433.39     NaN
4   1368725850     NaN     NaN  415.94
5   1368725880     NaN     NaN  401.59
6   1368726570  432.23     NaN     NaN
7   1368726660     NaN  424.08     NaN
8   1368726750     NaN     NaN  411.67
9   1368727290  471.01     NaN     NaN
10  1368727320  474.11     NaN     NaN
11  1368727410     NaN  440.76     NaN
12  1368727440     NaN  438.82     NaN
13  1368728250  469.46     NaN     NaN
14  1368728280  460.93     NaN     NaN
df = pd.concat([df.groupby(df.a.isnull().diff().cumsum().fillna(0)).mean().set_index('index')[['a']],
                df.groupby(df.b.isnull().diff().cumsum().fillna(0)).mean().set_index('index')[['b']],
                df.groupby(df.c.isnull().diff().cumsum().fillna(0)).mean().set_index('index')[['c']]], axis=1)

#drop rows with all NaN, remove index name (new in 0.18)
df = df.dropna(how='all').rename_axis(None)  
#convert unix time to datetime
df.index = pd.to_datetime(df.index, unit='s')
print df  
                           a        b        c
2013-05-16 17:33:45  482.455      NaN      NaN
2013-05-16 17:35:45      NaN  431.645      NaN
2013-05-16 17:37:45      NaN      NaN  408.765
2013-05-16 17:49:30  432.230      NaN      NaN
2013-05-16 17:51:00      NaN  424.080      NaN
2013-05-16 17:52:30      NaN      NaN  411.670
2013-05-16 18:01:45  472.560      NaN      NaN
2013-05-16 18:03:45      NaN  439.790      NaN
2013-05-16 18:17:45  465.195      NaN      NaN 

Explanations:

First you need create groups from values, where column contain numbers. You need fillna with value 0, because sometimes first value after function diff return NaN. In this sample it is column a only. But in real data it can be in column b and column c too.

df1 = pd.DataFrame( {'isnull': df.a.isnull()})
df1['diff'] = df1['isnull'].diff()
df1['cumsum'] = df1['diff'].cumsum().fillna(0)
print df1
   isnull   diff  cumsum
0   False    NaN     0.0
1   False  False     0.0
2    True   True     1.0
3    True  False     1.0
4    True  False     1.0
5    True  False     1.0
6   False   True     2.0
7    True   True     3.0
8    True  False     3.0
9   False   True     4.0
10  False  False     4.0
11   True   True     5.0
12   True  False     5.0
13  False   True     6.0
14  False  False     6.0

Then you can groupby by this groups and aggregate mean. Because you lost index, I create new column index, which is aggregate too. Then I set_index from column index and filter only one column a, b or c, because I concat all aggregated dataframes by this new index.

print df.groupby(df.a.isnull().cumsum().fillna(0)).mean()
        index        a       b       c
a                                     
0  1368725625  482.455     NaN     NaN
1  1368725730      NaN  429.90     NaN
2  1368725760      NaN  433.39     NaN
3  1368725850      NaN     NaN  415.94
4  1368726225  432.230     NaN  401.59
5  1368726660      NaN  424.08     NaN
6  1368727120  472.560     NaN  411.67
7  1368727410      NaN  440.76     NaN
8  1368727990  465.195  438.82     NaN
print df.groupby(df.a.isnull().cumsum().fillna(0)).mean().set_index('index')
                  a       b       c
index                              
1368725625  482.455     NaN     NaN
1368725730      NaN  429.90     NaN
1368725760      NaN  433.39     NaN
1368725850      NaN     NaN  415.94
1368726225  432.230     NaN  401.59
1368726660      NaN  424.08     NaN
1368727120  472.560     NaN  411.67
1368727410      NaN  440.76     NaN
1368727990  465.195  438.82     NaN
print df.groupby(df.a.isnull().cumsum().fillna(0)).mean().set_index('index')[['a']]
                 a
index              
1368725625  482.455
1368725730      NaN
1368725760      NaN
1368725850      NaN
1368726225  432.230
1368726660      NaN
1368727120  472.560
1368727410      NaN
1368727990  465.195

If you need more automatically aproach, use:

#convert to unix time (need integers from datetime for mean)
df.index = df.index.astype(np.int64) // 10**9
#create column index from df.index
df = df.reset_index()
#print df

dfs = []
#select all columns without first index column
for col in df.columns[1:]:
    dfs.append(df.groupby(df[col].isnull().diff().cumsum().fillna(0)).mean().set_index('index')[[col]])
df = pd.concat(dfs, axis=1)

#drop rows with all NaN
df = df.dropna(how='all').rename_axis(None)  
#convert unix time to datetime
df.index = pd.to_datetime(df.index, unit='s')
print df
                           a        b        c
2013-05-16 17:33:45  482.455      NaN      NaN
2013-05-16 17:35:45      NaN  431.645      NaN
2013-05-16 17:37:45      NaN      NaN  408.765
2013-05-16 17:49:30  432.230      NaN      NaN
2013-05-16 17:51:00      NaN  424.080      NaN
2013-05-16 17:52:30      NaN      NaN  411.670
2013-05-16 18:01:45  472.560      NaN      NaN
2013-05-16 18:03:45      NaN  439.790      NaN
2013-05-16 18:17:45  465.195      NaN      NaN
Community
  • 1
  • 1
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I wonder if there's a way to avoid having to manually set each column and make it a little more automatic. Great answer, though. – TomCho Mar 29 '16 at 01:02
  • Answer was edited, please check it. Sorry for delay. – jezrael Mar 29 '16 at 12:48
  • Thanks, your edit was exactly what I did when I implemented your answer 40 min ago :) I was referring more to a way of doing it without having to loop through the columns (sorry, I wasn't very clear). – TomCho Mar 29 '16 at 13:21
  • I think way with appending to list of `DataFrames` `dfs` is nice implemented as input in `concat` function. – jezrael Mar 29 '16 at 13:30