0

I'm working on a dataframe where it's important to keep the order. I would like to split it into chunks that I process afterwards.

The splitting is done based on the 3rd column type, all contiguous records with the same value of columns type (or any given categorical column) should be in one chunk, and if possible i want to done it in a pythonic way.

but I can only think of solutions where I iterate through the df. Consider that this will have to work on dataframes with tens of thousands of entries roughly, and I have no idea of the fastest strategy to do so. Here is a small example of what I have:

     value_1   value_2 type
0  -0.005842 -0.494596    a
1   0.697689  0.354717    a
2  -0.354206 -1.776550    a
3   2.154078  0.344629    a
4   1.072475  1.004945    a
5  -1.338075  0.175607    b
6  -1.913883 -0.123627    b
7  -0.021376 -0.170775    b
8  -0.274882 -0.043913    b
9   0.676371 -0.691243    b
10  0.440201 -0.577944    c
11 -0.689345 -0.445433    b
12  1.540386 -1.084499    c
13  0.236204 -0.072807    b
14 -0.257084  0.848501    c
15  0.681666 -0.265254    b
16 -1.168614 -0.359998    c
17  0.355938  1.529444    b
18  0.292976 -0.301847    c
19  0.670068  0.735191    b
20  0.551594 -0.074768    a
21 -1.251568 -0.022201    a
22  0.376663 -1.556191    a
23 -0.266714  0.860436    d
24 -0.871324  1.014529    d
25  1.504529 -0.657725    d

And here is how I would like to split it

    value_1   value_2 type
0  1.411723 -0.836490    a
1  0.482826  1.625925    a
2 -0.054475  2.046166    a
3  0.020816  0.155194    a
4  0.840539  0.287658    a
    value_1   value_2 type
5  0.257208 -2.311165    b
6 -1.545194 -0.193307    b
7  0.197849 -1.276644    b
8  0.074072 -0.172764    b
9 -2.562816  0.393645    b
     value_1   value_2 type
10  0.258265 -0.978293    c
     value_1  value_2 type
11 -0.804841 -0.78802    b
     value_1   value_2 type
12 -0.509034  1.116428    c
     value_1   value_2 type
13 -0.264252  1.025199    b
     value_1   value_2 type
14 -0.268105 -0.795613    c
     value_1   value_2 type
15  0.481051  0.184827    b
     value_1   value_2 type
16  1.242139  0.401806    c
     value_1   value_2 type
17  1.301684  0.281108    b
     value_1   value_2 type
18  0.189178  0.894425    c
     value_1   value_2 type
19 -0.093207  0.894564    b
     value_1   value_2 type
20 -2.231735  0.250696    a
21 -0.276050 -0.712792    a
22  0.298974 -0.529791    a
     value_1   value_2 type
23  0.115159  2.769695    d
24  0.636069 -1.066387    d
25  1.048230  1.500125    d

Something like a groupby that gives back just a list of slices according to the value of the chosen column would be perfect, I haven't found any existing function like that

TriGiamp
  • 35
  • 5
  • 1
    what is your criteria to split the dataframe ? – user96564 Sep 14 '21 at 13:24
  • Same value in "type" column, but without putting together all the rows that satisfy the condition. I need to keep the chunks in order to make it easier to process after – TriGiamp Sep 14 '21 at 13:29

1 Answers1

1

You can use a custom group and groupby to split your data:

df.groupby(df['type'].ne(df['type'].shift()).cumsum())

Then you can iterate over it:

groups = df.groupby(df['type'].ne(df['type'].shift()).cumsum())

for group_id, group_df in groups:
    # do stuff
mozway
  • 194,879
  • 13
  • 39
  • 75