resample and aggregate using multiple named aggregation functions on multiple columns

Question

I have a dataframe like

import pandas as pd
import numpy as np
range = pd.date_range('2015-01-01', '2015-01-5', freq='15min')
df = pd.DataFrame(index = range)
df['speed'] = np.random.randint(low=0, high=60, size=len(df.index))
df['otherF'] = np.random.randint(low=2, high=42, size=len(df.index))

I can easily resample and apply a builtin as sum():

df['speed'].resample('1D').sum()
Out[121]: 
2015-01-01    2865
2015-01-02    2923
2015-01-03    2947
2015-01-04    2751

I can also apply a custom function returning multiple values:

def mu_cis(x):
     x_=x[~np.isnan(x)]
     CI=np.std(x_)/np.sqrt(x.shape)
     return np.mean(x_),np.mean(x_)-CI,np.mean(x_)+CI,len(x_)

df['speed'].resample('1D').agg(mu_cis)
Out[122]: 
2015-01-01     (29.84375, [28.1098628611], [31.5776371389], 96)
2015-01-02    (30.4479166667, [28.7806726396], [32.115160693...
2015-01-03    (30.6979166667, [29.0182072972], [32.377626036...
2015-01-04       (28.65625, [26.965228204], [30.347271796], 96)

As I have read here, I can even multiple values with a name, pandas apply function that returns multiple values to rows in pandas dataframe

def myfunc1(x):
    x_=x[~np.isnan(x)]
    CI=np.std(x_)/np.sqrt(x.shape)
    e=np.mean(x_) 
    f=np.mean(x_)+CI
    g=np.mean(x_)-CI
    return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])

df['speed'].resample('1D').agg(myfunc1)

which gives

Out[124]: 
2015-01-01  MU             29.8438
        MU+    [31.5776371389]
        MU-    [28.1098628611]
2015-01-02  MU             30.4479
        MU+    [32.1151606937]
        MU-    [28.7806726396]
2015-01-03  MU             30.6979
        MU+    [32.3776260361]
        MU-    [29.0182072972]
2015-01-04  MU             28.6562
        MU+     [30.347271796]
        MU-     [26.965228204]

However, when I try to apply this to all the original columns by, I only get NaNs:

df.resample('1D').agg(myfunc1)
Out[127]: 
        speed  otherF
2015-01-01    NaN     NaN
2015-01-02    NaN     NaN
2015-01-03    NaN     NaN
2015-01-04    NaN     NaN
2015-01-05    NaN     NaN

Results do not change using agg or apply after the resample().

What is the right way to do this?

Uvar · Accepted Answer · 2017-10-02T14:45:15.353

1

The problem is in myfunc1. It tries to return a pd.Series, while you have a pd.DataFrame. The following seems to work just fine.

def myfunc1(x):
    x_=x[~np.isnan(x)]
    CI=np.std(x_)/np.sqrt(x.shape)
    e=np.mean(x_)
    f=np.mean(x_)+CI
    g=np.mean(x_)-CI
    try:
        return pd.DataFrame([e,f,g], index=['MU', 'MU+', 'MU-'], columns = x.columns)
    except AttributeError: #will still raise errors of other nature
        return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])

Alternatively:

def myfunc1(x):
    x_=x[~np.isnan(x)]
    CI=np.std(x_)/np.sqrt(x.shape)
    e=np.mean(x_)
    f=np.mean(x_)+CI
    g=np.mean(x_)-CI
    if x.ndim > 1: #Equivalent to if len(x.shape) > 1
        return pd.DataFrame([e,f,g], index=['MU', 'MU+', 'MU-'], columns = x.columns)
    return pd.Series([e,f,g], index=['MU', 'MU+', 'MU-'])

edited Oct 02 '17 at 14:45

answered Oct 02 '17 at 14:21

Uvar

3,372
12
25

Great! Do you know if there is anyway to check whether a DataFrame or a Series needs to be returned based on input size? – 00__00__00 Oct 02 '17 at 14:27
I am asking this because I would avoid a try block down there, as I expect exceptions for other reasons – 00__00__00 Oct 02 '17 at 14:28
Do they all raise AttributeErrors (`except AttributeError`)? I suppose you can also do it based on something like `if len(x.shape > 1: return pd.DataFrame ..... ; else: return pd.Series.......` – Uvar Oct 02 '17 at 14:32
Updated the answer to match previous comment – Uvar Oct 02 '17 at 14:37

resample and aggregate using *multiple* *named* aggregation functions on *multiple* columns

1 Answers1

resample and aggregate using multiple named aggregation functions on multiple columns