Pandas: Randomly sample 5 consecutive rows based on a condition (value in another column)

Question

For my data, I want to sample 5 lots of 5 consecutive days. For each 'group' of 5-day samples, I want the value in another column to be the same. My data is a time series. Here's a sample:

Previously, when I was happy with non-consecutive days, I'd use the following code:

df.groupby("AGENT").sample(n=5, random_state=1, replace = True)

I want it to be random, so I don't just want to take the index for the first new agent and then the subsequent 4 rows.

@mozway Sorry, what do you mean? You mean add more examples of what I've tried? — YoungboyVBA, Apr 24 '23 at 20:50
No, I meant to not use an image of data: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — mozway, Apr 24 '23 at 21:31

mozway · Accepted Answer · 2023-04-24T21:23:33.063

1

One option is to use a custom groupby.apply:

import numpy as np

n = 5
out = (df.groupby('Agent', group_keys=False)
         .apply(lambda g: g.iloc[(x:=np.random.randint(0, len(g)-n)): x+n])
      )

If you have python < 3.8:

import numpy as np

def random_consecutives(g, n):
    start = np.random.randint(0, len(g)-n)
    return g.iloc[start: start+n]

out = (df.groupby('Agent', group_keys=False)
         .apply(random_consecutives, n=5)
      )

Example output:

    Agent  Sales (k)        Date
2       1        1.2  21/08/2012
3       1        6.7  22/08/2012
4       1        5.8  23/08/2012
5       1        9.3  24/08/2012
6       1        8.3  25/08/2012
12      2        8.0  06/07/2012
13      2        0.9  07/07/2012
14      2        1.3  08/07/2012
15      2        1.6  09/07/2012
16      2        8.9  10/07/2012

edited Apr 24 '23 at 21:23

answered Apr 24 '23 at 20:50

mozway

194,879
13
39
75

Thank you. Will read the documentation now. I do get an error using this code though. Invalid syntax for this part: x:= – YoungboyVBA Apr 24 '23 at 21:11
This is a syntax only supported by python ≥ 3.8 (item assignement operator). Let me rewrite it for older versions. – mozway Apr 24 '23 at 21:20
Have you tested the updated answer? – mozway Apr 25 '23 at 22:51
Yeah, didn't work. Doing some further reading now. Will update as soon as I get somewhere. Sorry for the delay. – YoungboyVBA Apr 25 '23 at 22:52
How did it "*not work*"? Do you have an error? – mozway Apr 25 '23 at 22:57
`g` is defined by the `lambda` or is an internal name to the function. You do not need to define it yourself – mozway May 03 '23 at 14:05
Thank you. Works perfectly now. Really appreciate it. Will buy you a coffee on the commute to work later. Any other documentation you'd recommend reading to make sure i fully understand this? – YoungboyVBA May 03 '23 at 14:07
Well, the [pandas documentation](https://pandas.pydata.org/docs) is a gold mine, although here it's mostly custom code. Don't hesitate to ask if you need clarification – mozway May 03 '23 at 14:16

Pandas: Randomly sample 5 consecutive rows based on a condition (value in another column)

1 Answers1