4

I have data which is grouped by 'student_id':

my_data = data.frame(student_id = c(1,1,1,1,1,2,2,2,2,2,3,3,3,3,3),
                     exam_no = c(1,2,3,4,5,1,2,3,4,5,1,2,3,4,5),
                     result = rnorm(15,60,10))


my_data
   student_id exam_no   result
1           1       1 56.60374
2           1       2 55.76655
3           1       3 53.81728
4           1       4 74.82202
5           1       5 34.91834
6           2       1 58.32422
7           2       2 60.38213
8           2       3 49.40390
9           2       4 63.85426
10          2       5 40.32912
11          3       1 69.54969
12          3       2 43.36639
13          3       3 37.97265
14          3       4 52.36436
15          3       5 61.62080

My Question:

For each student, I want to select a set of consecutive rows, with random start and end rows.

For example, keep exams 2-4 for student 1, keep exams 2-5 for student 2, etc.


I thought of the following way to do this:

Create a data frame that contains the max number of exams each student takes (in my problem, each student takes the same number of exams, but in the future this could be different)

library(dplyr)
counts = my_data %>% group_by(student_id) %>% summarise(counts = n())

# create variables that indicate where to start ("min") and where to end ("max") for each student
counts$min = sample(1:counts$counts, 1)
counts$max = sample(counts$min:counts$counts,1)

From here, I was then going to write a loop that would select rows between "min" and "max" index for each student (e.g. my_data[min:max]), but the results from the previous code are giving me warnings and illogical results:

Warning message:
In 1:counts$counts :
  numerical expression has 3 elements: only the first used

Warning messages:
1: In counts$min:counts$counts :
  numerical expression has 3 elements: only the first used
2: In counts$min:counts$counts :
  numerical expression has 3 elements: only the first used

# A tibble: 3 x 4
  student_id counts   min   max
       <dbl>  <int> <int> <int>
1          1      5     4     5
2          2      5     4     5
3          3      5     4     5

I am not sure how to continue this - can someone please show me how to continue?

Thanks!

Henrik
  • 65,555
  • 14
  • 143
  • 159
stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • Do you always need multiple row per group? Student 1 has 5 rows. In your attempt, if the min row was 5 for student 1, there are no more rows left to sample. Is that okay? Or do you need the method to get at least 2 rows per student? What distribution of number of row per student would you like? If we continue with your method - randomly pick a start and end row - you will tend to pick more rows from students with more data. – Gregor Thomas Feb 19 '23 at 19:05
  • Alternately, you could do something like (a) pick a start row and then (b) pick a number of rows to sample from that start row. That would let you easily do something like "pick 3, 4, or 5 random rows from each student" - though depending on how concerned you are about bias it could prefer later row to earlier rows... – Gregor Thomas Feb 19 '23 at 19:06
  • Related: [select two random and consecutive rows from grouped data](https://stackoverflow.com/questions/52546036/select-two-random-and-consecutive-rows-from-grouped-data) (disclaimer: I apparently also answered it... ;); [selecting n randomly sampled consecutive rows across all levels of a factor](https://stackoverflow.com/questions/23836875/selecting-n-randomly-sampled-consecutive-rows-across-all-levels-of-a-factor-with), but with a fixed sample size. – Henrik Feb 19 '23 at 19:37

2 Answers2

4

A base R option using cumsum to label the in-between consecutive rows

subset(
  my_data,
  ave(
    exam_no,
    student_id,
    FUN = function(x) cumsum(seq_along(x) %in% sample.int(length(x), 2))
  ) == 1
)

which gives, for example

   student_id exam_no   result
2           1       2 61.83643
3           1       3 51.64371
4           1       4 75.95281
6           2       1 51.79532
7           2       2 64.87429
8           2       3 67.38325
11          3       1 75.11781
12          3       2 63.89843
13          3       3 53.78759

A more compact version by data.table with a similar idea as above is

library(data.table)
setDT(my_data)[, .SD[cumsum((1:.N) %in% sample.int(.N, 2)) == 1], student_id]
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • @ ThomasIsCoding: Thank you for your answer! can you please explain this code - is it selecting a random consecutive sample from each student? – stats_noob Mar 08 '23 at 15:02
  • @stats_noob yes, a random consecutive sample from each student. It first generates two random positions and the indices in between is the desired. – ThomasIsCoding Mar 08 '23 at 19:13
3

Using data.table, within each group, sample two values from .I (without replacement), and create a sequence of indices.

library(data.table)
setDT(my_data)

set.seed(3)
my_data[my_data[ , {ix = sample(.I, 2); ix[1]:ix[2]}, by = student_id]$V1]

#   student_id exam_no   result
#         <num>   <num>    <num>
# 1:          1       5 74.05672
# 2:          1       4 49.37525
# 3:          1       3 67.41662
# 4:          1       2 67.64935
# 5:          2       4 55.15337
# 6:          2       3 58.95694
# 7:          3       4 50.79859
# 8:          3       3 53.66886
# 9:          3       2 47.01089
Henrik
  • 65,555
  • 14
  • 143
  • 159