Split large pandas data frame into smaller one as per time series

Question

I have a pandas data frame that contains 38 time series. Each time series starts at 0s and finishes at 1s, and a numerical column with the time in seconds between 0 and 1 is the only hint I have about where each time series starts and where it ends.

I would like to split the original df int 38 data frames.

I guess I could simply loop over the rows and perform checks until the value reaches 1s and then split, but I was wondering if there's a smarter and fastest way of doing this? At some point I will have 38.000 time series...

The dataframe would look something like:

rows ¦ var1 ¦ var2 ¦... ¦ time

row 1 ¦ x ¦ y ¦ ... ¦ 0.0

.

row 100 ¦ x100 ¦ y100 ¦ ... ¦ 1.0

row 101 ¦ x101 ¦ y101 ¦ ... ¦ 0.0

.

I would like to split the df in the row 100, such that row 101 is the first row of a new different dataframe. And I would repeat this procedure 38 times within a given df.

My question is not the same as [1], because in that case the person wants to group by same date values, and in my case I don't want to group by same values.

[1] Splitting dataframe into multiple dataframes

Possible duplicate of [Splitting dataframe into multiple dataframes](https://stackoverflow.com/questions/19790790/splitting-dataframe-into-multiple-dataframes) — abolotnov, Sep 06 '19 at 14:41
@abolotnov I would say it's not a duplicate, because that person wants to group by same date. I don't want to group by same values of a column. — jotNewie, Sep 06 '19 at 14:43

moys · Accepted Answer · 2019-09-06T15:24:53.550

Thanks for the hint @Alollz

group=[]
for k,g in df.groupby(df.time.eq(1).shift().fillna(0).cumsum()):
    group.append(g)

You can call the group you need with

group[0] , group[1] , group[2]....

Details

Starting dataframe

row     var1    var2    time
row1    x1       y1     0
row2    x2       y2     0
row3    x3       y3     0
row4    x4       y4     0
row5    x5       y5     0
row6    x6       y6     0
row7    x7       y7     0
row8    x8       y8     1
row9    x9       y9     0
row10   x10     y10     0
row11   x11     y11     0
row12   x12     y12     0
row13   x13     y13     0
row14   x14     y14     1
row15   x15     y15     0
row16   x16     y16     0
row17   x17     y17     0
row18   x18     y18     0

with df.time.eq(1).shift().fillna(0).cumsum(), we are essentially creating a column for us to groupby. Shown here with column s

row     var1    var2    time    s
row1    x1       y1     0       0
row2    x2       y2     0       0
row3    x3       y3     0       0
row4    x4       y4     0       0
row5    x5       y5     0       0
row6    x6       y6     0       0
row7    x7       y7     0       0
row8    x8       y8     1       0
row9    x9       y9     0       1
row10   x10     y10     0       1
row11   x11     y11     0       1
row12   x12     y12     0       1
row13   x13     y13     0       1
row14   x14     y14     1       1
row15   x15     y15     0       2
row16   x16     y16     0       2
row17   x17     y17     0       2
row18   x18     y18     0       2

then we do group-by on column s essentially (even though we never create column s). Since each group is essentially a dataframe, you have separate dataframes.

if we use df.time.eq(1).fillna(0).cumsum(), we have the row which changes to 1 in the next dataframe. The data for grouping shown in column s2

    row     var1    var2 time   s   s2
0   row1    x1       y1     0   0   0
1   row2    x2       y2     0   0   0
2   row3    x3       y3     0   0   0
3   row4    x4       y4     0   0   0
4   row5    x5       y5     0   0   0
5   row6    x6       y6     0   0   0
6   row7    x7       y7     0   0   0
7   row8    x8       y8     1   0   1
8   row9    x9       y9     0   1   1
9   row10   x10     y10     0   1   1
10  row11   x11     y11     0   1   1
11  row12   x12     y12     0   1   1
12  row13   x13     y13     0   1   1
13  row14   x14     y14     1   1   2
14  row15   x15     y15     0   2   2
15  row16   x16     y16     0   2   2
16  row17   x17     y17     0   2   2
17  row18   x18     y18     0   2   2

I am not sure this is actually working - it does not seem to store all the data between the second 1 and 0. It does work for the first data frames, but then at some point it starts mixing up rows, I am not sure why... — jotNewie, Sep 06 '19 at 15:11
Could you explain a bit more how it is supposed to work? That should maybe help understanding what might be going wrong :) — jotNewie, Sep 06 '19 at 15:12
@jotNewie can you provide a few more rows of your Data? This *should* work if your data are well organized `[0, 0.2, 0,4, 0.6, 0.8, 1, 0, ...]` (spacing is irrelevant) but there are several ways this could break. Also just realized it might be simpler to group with `df.time.eq(0).cumsum()`. perhaps the ends aren't truly 1, but the beginning is always truly 0? — ALollz, Sep 06 '19 at 15:15
@ALollz you are right, some of my original rows were disordered already within the original file, and this is what confused me. But the code works perfectly fine — jotNewie, Sep 06 '19 at 16:05

Split large pandas data frame into smaller one as per time series

1 Answers1