1

Are there any rapid ways to randomly sample N hours for each day in a multi-year, multi-indexed, and hourly data set using the pandas tools? My goal is go get N random hours for each day and each X,Y pair.

If my data looked like so:

In [21]: df
Out[21]:
                            Stuff
Date                X Y
2004-01-01 02:00:00 0 1  1.047065
2004-01-01 03:00:00 0 1 -1.048725
2004-01-01 04:00:00 0 1 -0.245098
2004-01-01 05:00:00 0 1  0.452306
2004-01-01 02:00:00 2 3  0.100935
2004-01-01 03:00:00 2 3 -1.183009
2004-01-01 04:00:00 2 3  0.164260
2004-01-01 05:00:00 2 3 -1.013031
2004-01-01 02:00:00 4 2 -0.300900
2004-01-01 03:00:00 4 2  0.698377
2004-01-01 04:00:00 4 2  0.335517
2004-01-01 05:00:00 4 2 -0.421466
2004-01-01 02:00:00 7 9 -0.904358
2004-01-01 03:00:00 7 9  1.496770
2004-01-01 04:00:00 7 9 -0.966784
2004-01-01 05:00:00 7 9  0.101442
2004-01-02 02:00:00 0 1  0.771495
2004-01-02 03:00:00 0 1 -1.559194
2004-01-02 04:00:00 0 1  0.497352
2004-01-02 05:00:00 0 1  0.377913
2004-01-02 02:00:00 2 3  0.637454
2004-01-02 03:00:00 2 3 -0.381010
2004-01-02 04:00:00 2 3  1.973359
2004-01-02 05:00:00 2 3  0.390250
2004-01-02 02:00:00 4 2  0.948655
2004-01-02 03:00:00 4 2  0.234342
2004-01-02 04:00:00 4 2  0.766474
2004-01-02 05:00:00 4 2 -0.529767
2004-01-02 02:00:00 7 9  0.682759
2004-01-02 03:00:00 7 9  2.202768
2004-01-02 04:00:00 7 9  2.190237
2004-01-02 05:00:00 7 9 -1.641499

I would hope to get a result that is akin to (if N =2):

                            Stuff
Date                X Y
2004-01-01 02:00:00 0 1  1.047065
2004-01-01 05:00:00 0 1  0.452306
2004-01-01 04:00:00 2 3  0.164260
2004-01-01 05:00:00 2 3 -1.013031
2004-01-01 02:00:00 4 2 -0.300900
2004-01-01 03:00:00 4 2  0.698377
2004-01-01 02:00:00 7 9 -0.904358
2004-01-01 05:00:00 7 9  0.101442
2004-01-02 03:00:00 0 1 -1.559194
2004-01-02 04:00:00 0 1  0.497352
2004-01-02 04:00:00 2 3  1.973359
2004-01-02 05:00:00 2 3  0.390250
2004-01-02 02:00:00 4 2  0.948655
2004-01-02 05:00:00 4 2 -0.529767
2004-01-02 04:00:00 7 9  2.190237
2004-01-02 05:00:00 7 9 -1.641499

1 Answers1

2

Update: You changed your question to group by X and Y as well as time. To use a TimeGrouper (as I do, below, in my answer to your original question) along with other grouping criterion (e.g., ['X', 'Y']) see this answer.

Group hourly, and use transform with this answer like so:

df.groupby(pd.TimeGrouper('H')).transform(lambda x: x[random.sample(x.index, N)])

Example: I generate a data set with multiple samples per hour, and I randomly choose two from each hour.

In [62]: df = DataFrame(np.random.randn(6), pd.date_range(freq='20T', start=pd.datetime.now(), periods=6))

In [63]: df
Out[63]: 
                            0
2013-10-08 14:18:49  0.709713
2013-10-08 14:38:49  1.413776
2013-10-08 14:58:49 -0.725483
2013-10-08 15:18:49  1.251557
2013-10-08 15:38:49 -1.049705
2013-10-08 15:58:49  1.100699

In [65]: df.groupby(pd.TimeGrouper('H')).transform(lambda x: x[random.sample(x.index, 2)])
Out[65]: 
                            0
2013-10-08 14:18:49  0.709713
2013-10-08 14:58:49 -0.725483
2013-10-08 15:38:49 -1.049705
2013-10-08 15:58:49  1.100699

I used in the built-in module random. Version 1.7 of numpy will add numpy.choice for the same functionality, I assume somewhat faster.

Community
  • 1
  • 1
Dan Allan
  • 34,073
  • 6
  • 70
  • 63
  • Dan, thanks for your response. I believe it is getting at what I want. However, I am about to change my question to get a view on my larger problem. Stay tuned... – Cloudwalker Oct 08 '13 at 18:32