2

I have a Pandas DataFrame named df with column named 'step', which is just an incremental counter (1,2,3,4,etc):

step col1 col2
1    2    3
2    3    5
3    1    0
4    8    9
5    2    3

I'm selecting from df some rows of interest:

    work_df = df[df[col1] < df[col2]]

step col1 col2
1    2    3
2    3    5
4    8    9
5    2    3

Now I should split work_df to some sub_df's by continuity of 'step' (i.e. if work_df['step'] == [1,2,3,7,8,9] then [1,2,3] belongs to sub_df_1 while [7,8,9] belongs to sub_df_2 etc.), currently I'm doing it this way:

        for idx, row in work_df.iterrows():
            if row['step'] > prev_step + 1:
                if step_count > 1: #don't want to have df with only 1 row
                    interval_list.append({'step_count': step_count ... })
                step_count = 0
            else:
                step_count += 1
            prev_step = row['step']

I'm building new sub_df's then based on information from interval_list. But I'm not sure this is the best way to achieve what I really need:

sub_df1=
step col1 col2
1    2    3
2    3    5

sub_df2=
step col1 col2
4    8    9
5    2    3

Are there better ways to split DataFrame by continuity of a column?

chernish2
  • 113
  • 1
  • 9

1 Answers1

1

You can here group by (df[col] != df[col].shift(1)+1).cumsum(), or as @MarkWang says df['data'].diff().ne(1).cumsum(). Indeed, if we use your sample data:

>>> df
   data
0     1
1     2
2     3
3     7
4     8
5     9

then this expression will yield:

>>> df['data'].diff().ne(1).cumsum()
0    1
1    1
2    1
3    2
4    2
5    2
Name: data, dtype: int64

So we can here perform a groupby on these values:

>>> list(df.groupby(df['data'].diff().ne(1).cumsum()))
[(1,    data
0     1
1     2
2     3), (2,    data
3     7
4     8
5     9)]

We thus here have two groups: [1,2,3] and [7,8,9]. You thus can here obtain your two groups with:

>>> (__, sub_df_1), (__, sub_df_2) = df.groupby(df['data'].diff().ne(1).cumsum())
>>> sub_df_1
   data
0     1
1     2
2     3
>>> sub_df_2
   data
3     7
4     8
5     9

or for the second sample data fragment you provided:

>>> (__, sub_df1), (__, sub_df2) = df2.groupby(df2['data'].diff().ne(1).cumsum())
>>> sub_df1
   step  col1  col2
0     1     2     3
1     2     3     5
>>> sub_df2
   step  col1  col2
2     4     8     9
3     5     2     3

You can here for example use list comprehension, or an itemgetter to obtain the corresponding groups. For example:

>>> [g[1] for g in df2.groupby(df2['step'].diff().ne(1).cumsum())]
[   step  col1  col2
0     1     2     3
1     2     3     5,    step  col1  col2
2     4     8     9
3     5     2     3]

You can drop groups of length with 1 as well:

>>> [g[1] for g in df2.groupby(df2['step'].diff().ne(1).cumsum()) if len(g[1]) > 1]
[   step  col1  col2
0     1     2     3
1     2     3     5,    step  col1  col2
2     4     8     9
3     5     2     3]

This list thus contains the two subgroups. I strongly advice against assign by name. Yes it is strictly speaking possible. But it is an anti-pattern, and it usually does more harm than good.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • 2
    `cumsum` over boolean is always a nice way to create groups:) – Mark Wang Aug 17 '19 at 21:23
  • 2
    @MarkWang: yes, I was quite amused when I first discovered this trick :) – Willem Van Onsem Aug 17 '19 at 21:23
  • 1
    Though I would write something like `df['data'].diff().ne(1)` to generate boolean – Mark Wang Aug 17 '19 at 21:27
  • Thank you very much but it's a bit complicated for Pandas newbie like me. As a fact I could get any number of sub_df's, so I guess there should be something like a loop somewhere? – chernish2 Aug 17 '19 at 21:30
  • 1
    @user3523475: you can use `from operator import itemgetter`, and then `list(map(itemgetter(1)), df2.groupby(df2['data'].diff().ne(1).cumsum()))` to obtain a list of all your subgroups. – Willem Van Onsem Aug 17 '19 at 21:31