0

Lets say I have a pandas.DataFrame that has hourly data for 3 days:

import pandas as pd
import numpy as np
import datetime as dt
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(np.random.randn(3*24,2),index=dates,columns=list('AB'))

I would like to get every, let's say, 6 hours of data and independently fit a curve to that data. Since pandas' resample function has a how keyword that is supposed to be any numpy array function, I thought that I could maybe try to use resample to do that with polyfit, but apparently there is no way (right?).

So the only alternative way I thought of doing that is separating df into a sequence of DataFrames, so I am trying to create a function that would work such as

l=splitDF(df, '6H')

and it would return to me a list of dataframes, each one with 6 hours of data (except maybe the first and last ones). So far I got nothing that could work except something like the following manual method:

def splitDF(data, rule):
        res_index=data.resample(rule).index
        out=[]
        cont=0
        for date in data.index:
              ... check for date in res_index ...
              ... and start cutting at those points ...

But this method would be extremely slow and there is probably a faster way to do it. Is there a fast (maybe even pythonic) way of doing this?

Thank you!

EDIT

A better method (that needs some improvement but it's faster) would be the following:

def splitDF(data, rule):
    res_index=data.resample(rule).index
    out=[]
    pdate=res_index[0]
    for date in res_index:
            out.append(data[pdate:date][:-1])
            pdate=date
    out.append(data[pdate:])
    return out

But still seems to me that there should be a better method.

Community
  • 1
  • 1
TomCho
  • 3,204
  • 6
  • 32
  • 83

2 Answers2

1

Ok, so this sounds like a textbook case for using groupby. Here's my thinking:

import pandas as pd

#let's define a function that'll group a datetime-indexed dataframe by hour-interval/date
def create_date_hour_groups(df, hr):
     new_df = df.copy()
     hr_int = int(hr)
     new_df['hr_group'] = new_df.index.hour/hr_int
     new_df['dt_group'] = new_df.index.date
     return new_df

#now we define a wrapper for polyfit to pass to groupby.apply
def polyfit_x_y(df, x_col='A', y_col='B', poly_deg=3):
    df_new = df.copy()
    coef_array = pd.np.polyfit(df_new[x_col], df_new[y_col], poly_deg)
    poly_func = pd.np.poly1d(coef_array)
    df_new['poly_fit'] = poly_func(df[x_col])
    return df_new

#to the actual stuff
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(pd.np.random.randn(3*24,2),index=dates,columns=list('AB'))
df = create_date_hour_groups(df, 6)
df_fit = df.groupby(['dt_group', 'hr_group'],
                    as_index=False).apply(polyfit_x_y)
user2734178
  • 227
  • 1
  • 9
  • That is very good thinking. I can't implement it right now (I'll do it once I'm my computer) but I see the thinking. Only issue is that if the names if the columns change, I'll have to change `polyfit_x_y` as well, right? Uh, other thing is that I was actually wanting to fit the curve based on the datetime index. But that can be easily fixed. – TomCho Aug 21 '15 at 00:48
0

How about?

np.array_split(df,len(df)/6)
JAB
  • 12,401
  • 6
  • 45
  • 50
  • 1
    This was a suggestion for a solution. How about? was asking the OP if the solution gets him what he needs. I am surprised that this needed explaining. – JAB Aug 20 '15 at 21:56
  • I tried that but turns out when the index is a date this splits the df unevenly. – TomCho Aug 21 '15 at 19:43