22

I'm looking for a pandas equivalent of the resample method for a dataframe whose isn't a DatetimeIndex but an array of integers, or maybe even floats.

I know that for some cases (this one, for example) the resample method can be substituted easily by a reindex and interpolation, but for some cases (I think) it can't.

For example, if I have

df = pd.DataFrame(np.random.randn(10,2))
withdates = df.set_index(pd.date_range('2012-01-01', periods=10))
withdates.resample('5D', np.std)

this gives me

                   0         1
2012-01-01  1.184582  0.492113
2012-01-06  0.533134  0.982562

but I can't produce the same result with df and resample. So I'm looking for something that would work as

 df.resample(5, np.std)

and that would give me

          0         1
0  1.184582  0.492113
5  0.533134  0.982562

Does such a method exist? The only way I was able to create this method was by manually separating df into smaller dataframes, applying np.std and then concatenating everything back, which I find pretty slow and not smart at all.

Cheers

Community
  • 1
  • 1
TomCho
  • 3,204
  • 6
  • 32
  • 83

3 Answers3

16

Setup

import pandas as pd
import numpy as np

np.random.seed([3,1415])
df = pd.DataFrame(np.random.rand(20, 2), columns=['A', 'B'])

You need to create the labels to group by yourself. I'd use:

(df.index.to_series() / 5).astype(int)

To get you a series of values like [0, 0, 0, 0, 0, 1, 1, 1, 1, 1, ...] Then use this in a groupby

You'll also need to specify the index for the new dataframe. I'd use:

df.index[4::5]

To get a the current index starting at the 5th position (hence the 4) and every 5th position after that. It will look like [4, 9, 14, 19]. I could've done this as df.index[::5] to get the starting positions but I went with ending positions.

Solution

# assign as variable because I'm going to use it more than once.
s = (df.index.to_series() / 5).astype(int)

df.groupby(s).std().set_index(s.index[4::5])

Looks like:

           A         B
4   0.198019  0.320451
9   0.329750  0.408232
14  0.293297  0.223991
19  0.095633  0.376390

Other considerations

This is for the equivalent of down sampling. We haven't addressed up sampling.

To go back from what we've produced to a dataframe index by something more frequent, we can use reindex like so:

# assign what we've done above to df_down
df_down = df.groupby(s).std().set_index(s.index[4::5])

df_up = df_down.reindex(range(20)).bfill()

Looks like:

           A         B
0   0.198019  0.320451
1   0.198019  0.320451
2   0.198019  0.320451
3   0.198019  0.320451
4   0.198019  0.320451
5   0.329750  0.408232
6   0.329750  0.408232
7   0.329750  0.408232
8   0.329750  0.408232
9   0.329750  0.408232
10  0.293297  0.223991
11  0.293297  0.223991
12  0.293297  0.223991
13  0.293297  0.223991
14  0.293297  0.223991
15  0.095633  0.376390
16  0.095633  0.376390
17  0.095633  0.376390
18  0.095633  0.376390
19  0.095633  0.376390

We could also use other things to reindex by like range(0, 20, 2) to up sample to even integer indices.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Nice solution. I wasn't worrying much about the upsampling, but now that you did, I was wondering about the reindexing. Maybe it would be better to reindex it with floats (something like `np.linspace(start, end, number_of_indexes)`). Is this worth a github issue as an improvement? – TomCho May 24 '16 at 00:07
  • I just tried it on floats... Not good. It would be interesting to see this formalized. More importantly, did I interpret your question correctly? And did I successfully answer it? – piRSquared May 24 '16 at 00:30
  • Yes, don't worry, I'll accept your answer in a second. You actually inspired me to come up with my own answer which I'll post it just for sake of the community. It's longer, but it may be more robust. I'll post it in a sec. Your thoughts on it will be appreciated. – TomCho May 24 '16 at 00:32
  • This works great for example data. But if I run it on my real data, at this point `s.index[4::5]` I get in trouble. I can't start at an arbitrary point, only a few and then I get a `ValueError: Lengths mismatch.` I don't understand why. I want to subsample data to every 100th. s.index[13::100] works fine, but s.index[14::100] and any higher throws the error. Where is my dimension lost? – durbachit Dec 15 '16 at 02:38
3

Alternative, this is one thing that can be done

def resample(df, rule, how=None, **kwargs):
    import pandas as pd
    if how==None:
        import numpy as np
        how = np.mean

    if isinstance(df.index, pd.DatetimeIndex) and isinstance(rule, str):
        return df.resample(rule, how, **kwargs)
    else:
        idx, bins = pd.cut(df.index, range(df.index[0], df.index[-1]+2, rule), right=False, retbins=True)
        aux = df.groupby(idx).apply(how)
        aux = aux.set_index(bins[:-1])
        return aux
TomCho
  • 3,204
  • 6
  • 32
  • 83
  • looks good but I think idx and bins are swapped here (i.e. the output of pd.cut is bins, idx, and then groupby should be by bins, and set_index with index) – MrE Jun 19 '18 at 01:50
1

@piSquared solution is really nice, but I don't like picking index per hand at reindexing.

This should works too for each kind of downsampling (float index too) and automatically pick the mean of the index in each range:

df = pd.DataFrame(index = np.random.rand(20)*30, data=np.random.rand(20, 2), columns=['A', 'B'])
df.index.name = 'crazy_index'

s = (df.index.to_series() / 10).astype(int)

Now you can pick the function you want to calculate in each sub group at your will:

# calculate std() in each group
df.groupby(s).mean().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )

                    A         B
crazy_index
3.667539     0.276986  0.317642
14.275074    0.248700  0.372551
25.054042    0.254860  0.297586

# calculate median() in each group
df.groupby(s).median().set_index( s.groupby(s).apply(lambda x: np.mean(x.index)) )
Out[38]:
                    A         B
crazy_index
3.667539     0.454654  0.521649
14.275074    0.451265  0.490125
25.054042    0.489326  0.622781

EDIT : There were some errors in s indexing, now it is correct & working.

kidpixo
  • 516
  • 7
  • 13