0

I have a dataframe with two groups, but they don't have any identifiers. These two groups are interspersed many times but always by the same chunk size. For example, in a dataframe with 100 rows, the first 10 rows belong to group A, the next 6 rows belong to group B, the next 10 belong again to group A, etc. In the snippet below, rows 1 to 10 belong to group A, 11 to 16 belong to group B, 17 to 26 to group A again, and so on.

1,0.001284150523134
2,0.002207901328802
3,0.002915323944762
4,0.003469731891528
5,0.003921566996723
6,0.004299059510231
7,0.004616158548743
8,0.004884272348136
9,0.005112133454531
10,0.005309570115060
11,0.004684340208769
12,0.004182199947536
13,0.003777556587011
14,0.003452226985246
15,0.003190805669874
16,0.002980756806210
17,0.003067432902753
18,0.003176181111485
19,0.003286415245384
20,0.003386073280126
21,0.003470669966191
22,0.003541931044310
23,0.003600175259635
24,0.003642340423539
25,0.003669032361358
26,0.003684990806505
...

How can I split this dataframe in two? Or better, how can I apply a calculation/function to each of these chunks, one at a time?

gfreytag
  • 33
  • 6
  • Probably duplicating https://stackoverflow.com/questions/7060272/split-up-a-dataframe-by-number-of-rows – thelatemail Jul 28 '21 at 22:27
  • I already tried it, but my two groups have DIFFERENT chunk sizes and are INTERSPERSED. – gfreytag Jul 28 '21 at 22:33
  • 1
    Maybe something like: `dat <- data.frame(id=c(1:32)); cumsum(rep(c(1:10,1:6), length.out=nrow(dat))==1)` ? If you can provide example data that is representative of your real data, then the corner cases could be accounted for. – thelatemail Jul 28 '21 at 22:52
  • and then store ^ as a column/factor/identifier then use split() and then use one of the applys (lapply, mclapply, etc)? – eyy Jul 28 '21 at 23:19

1 Answers1

1

I think you can create a counter using some sequences:

dat <- data.frame(id=c(1:32))

dat$grp <- rep(rep(c(1,2), c(10,6)), length.out=nrow(dat))
dat
#   id grp
#1   1   1
 ...
#10 10   1
#11 11   2
 ...
#16 16   2
#17 17   1
 ...
#26 26   1
#27 27   2
 ...
#32 32   2

Then you can use whatever function you want within each group via aggregate / by/dplyr::group_by/data.table's by= etc.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Just an addendum in case anyone needs it, to identify each chunk with a unique number, replace `c(10,6)` with `rep(c(10,6), times=2)`. – gfreytag Jul 29 '21 at 01:06