Slicing pandas dataframe by ordered values into clusters

Question

I have a pandas dataframe like there is longer gaps in time and I want to slice them into smaller dataframes where time "clusters" are together

           Time  Value
0   56610.41341  8.55
1   56587.56394  5.27
2   56590.62965  6.81
3   56598.63790  5.47
4   56606.52203  6.71
5   56980.44206  4.75
6   56592.53327  6.53
7   57335.52837  0.74
8   56942.59094  6.96
9   56921.63669  9.16
10  56599.52053  6.14
11  56605.50235  5.20
12  57343.63828  3.12
13  57337.51641  3.17
14  56593.60374  5.69
15  56882.61571  9.50

I tried sorting this and taking time difference of two consecutive points with

df = df.sort_values("Time")
df['t_dif'] = df['Time'] - df['Time'].shift(-1)

And it gives

           Time  Value     t_dif
1   56587.56394  5.27   -3.06571
2   56590.62965  6.81   -1.90362
6   56592.53327  6.53   -1.07047
14  56593.60374  5.69   -5.03416
3   56598.63790  5.47   -0.88263
10  56599.52053  6.14   -5.98182
11  56605.50235  5.20   -1.01968
4   56606.52203  6.71   -3.89138
0   56610.41341  8.55 -272.20230
15  56882.61571  9.50  -39.02098
9   56921.63669  9.16  -20.95425
8   56942.59094  6.96  -37.85112
5   56980.44206  4.75 -355.08631
7   57335.52837  0.74   -1.98804
13  57337.51641  3.17   -6.12187
12  57343.63828  3.12        NaN

Lets say I want to slice this dataframe to smaller dataframes where time difference between two consecutive points is smaller than 40 how would I go by doing this?

I could loop the rows but this is frowned upon so is there a smarter solution?

Edit: Here is a example:

df1:

           Time  Value     t_dif
1   56587.56394  5.27   -3.06571
2   56590.62965  6.81   -1.90362
6   56592.53327  6.53   -1.07047
14  56593.60374  5.69   -5.03416
3   56598.63790  5.47   -0.88263
10  56599.52053  6.14   -5.98182
11  56605.50235  5.20   -1.01968
4   56606.52203  6.71   -3.89138

df2:

0   56610.41341  8.55 -272.20230

df3:

15  56882.61571  9.50  -39.02098
9   56921.63669  9.16  -20.95425
8   56942.59094  6.96  -37.85112

... etc.

score 0 · Answer 1 · answered Jul 20 '21 at 12:01

0

I think you can just

df1 = df[df['t_dif']<30]

df2 = df[df['t_dif']>=30]

answered Jul 20 '21 at 12:01

Ulewsky

310
1
11

This doesn't work. This just puts times with points close to them into one dataframe and the rest into other one. – MesRot Jul 20 '21 at 12:05
I do not understand, I thought you want to have time where diff is smaller than 30, can you provide an example of the output that you want? – Ulewsky Jul 20 '21 at 12:31
Added an example of wanted output – MesRot Jul 20 '21 at 13:40
So to have this output you can write sth like this ```df1 = df[df['t_dif']>-10]``` ```df2 = df[(df['t_dif']<-10)&(df['t_dif']>-40)]``` ```df3 = df[df['t_dif']<-100]``` etc. – Ulewsky Jul 20 '21 at 14:04
And you can always edit the range it has not to be some strict numbers, you can use for this quantiles [link](https://stackoverflow.com/questions/65613620/pandas-how-assign-to-a-new-df-values-in-quantiles-using-greater-than-and-small) here you can read about that. – Ulewsky Jul 20 '21 at 14:09

MesRot · Accepted Answer · 2021-07-21T10:37:35.557

def split_dataframe(df, value):
    df = df.sort_values("Time")
    df = df.reset_index()
    df['t_dif'] = (df['Time'] - df['Time'].shift(-1)).abs()
    indxs = df.index[df['t_dif'] > value].tolist()

    indxs.append(-1)
    indxs.append(len(df))

    indxs.sort()
    frames = []

    for i in range(1, len(indxs)):
        val = df.iloc[indxs[i] + 1: indxs[i]]
        frames.append(val)

    return frames

Returns the correct dataframes as a list

Slicing pandas dataframe by ordered values into clusters

2 Answers2