Trying to convert the following R data.frame:
structure(list( Time=c("09:30:01" ,"09:30:29" ,"09:35:56", "09:37:17" ,"09:37:21" ,"09:37:28" ,"09:37:35" ,"09:37:51" ,"09:42:11" ,"10:00:31"),
Price=c(1,2,3,4,5,6,7,8,9,10),
Volume=c(100,200,300,100,200,300,100,200,600,100)),
.Names = c("Time", "Price", "Volume"),
row.names = c(NA,10L),
class = "data.frame")
Time Price Volume
1 09:30:01 1 100
2 09:30:29 2 200
3 09:35:56 3 300
4 09:37:17 4 100
5 09:37:21 5 200
6 09:37:28 6 300
7 09:37:35 7 100
8 09:37:51 8 200
9 09:42:11 9 600
10 10:00:31 10 100
into this
Time Price Volume Bin
1 09:30:01 1 100 1
2 09:30:29 2 200 1
3 09:35:56 3 200 1
4 09:35:56 3 100 2
5 09:37:17 4 100 2
6 09:37:21 5 200 2
7 09:37:28 6 100 2
8 09:37:28 6 200 3
9 09:37:35 7 100 3
10 09:37:51 8 200 3
11 09:42:11 9 500 4
12 09:42:11 9 100 5
13 10:00:31 10 100 5
Essentially, it is calculating cumulative sums on volume and binning eachtime 500 is breached. So, bin 1 is 100+200+200 with the volume at 09:35:56 split into 200/100 and a new row inserted and the bin counter incremented.
This is relatively straightforward with base R but I was wondering is there a more elegant and hopefully faster way with dplyr.
Cheers
Update:
Thank you @Frank and @AntoniosK.
To address your question the range of volume values is all positive integers values from 1 to 10k.
I microbenchmarked both approaches and dplyr was slightly faster but not much in it, on a dataset similar to above with ~200k rows.
Really appreciate the swift responses and assistance