0

I have a column with a list of values like so:

100
200
300
500
600
650
1000

I want to do a Groupby (or a similar efficient construct) to get batches of rows where the value of each row is within 100 of the last row.

In that case the batches produced from the example above would be

100, 200, 300,

500, 600, 650

1000

Is this possible to do in Pandas? Since Pandas attempts to allow for SQL-like queries, I am guessing that it should be.

The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156

1 Answers1

4

You can use an approach similar to that described in the answer to this question. It's basically a three-step process:

  1. Use shift to compute the inter-row criterion that you want to distinguish.
  2. Use cumsum to sum this criterion to create a new Series with separate "blocks" of a single value for each group.
  3. Group on this new Series.

Here is an example:

>>> x = pandas.Series([100, 200, 300, 500, 600, 650, 1000, 900, 750])
>>> x.groupby(((x - x.shift()).abs() > 100).cumsum()).apply(list)
0    [100, 200, 300]
1    [500, 600, 650]
2        [1000, 900]
3              [750]
dtype: object

Note that I used the criterion > 100, which is the opposite of the <= 100 criterion you mentioned. With this approach, you need to use the criterion for separating groups, not the criterion for joining them, so you have to use the negation of your grouping criterion.

Community
  • 1
  • 1
BrenBarn
  • 242,874
  • 37
  • 412
  • 384