Select random rows from duplicate IDS

Question

I'm dealing with a dataset where I have students ratings of teachers. Some students rated the same teacher more than once. What I would like to do with the data is to subset it with the following criteria:

1) Keep any unique student Ids and ratings

2) In cases where students rated a teacher twice keep only 1 rating, but to select which rating to keep randomly.

3) If possible I'd like to be able to run the code in a munging script at the top of every analysis file and ensure that the dataset created is exaclty the same for each analysis (set seed?).

# data
student.id <- c(1,1,2,3,3,4,5,6,7,7,7,8,9)
teacher.id <- c(1,1,1,1,1,2,2,2,2,2,2,2,2)
rating <- c(100,99,89,100,99,87,24,52,100,99,89,79,12)
df <- data.frame(student.id,teacher.id,rating)

Thanks for any guidance for how to move forward.

lmo · Accepted Answer · 2016-07-21T19:11:29.300

Assuming that each student.id is only applied to one teacher, you could use the following method.

# get a list containing data.frames for each student
myList <- split(df, df$student.id)

# take a sample of each data.frame if more than one observation or the single observation
# bind the result together into a data.frame
set.seed(1234)
do.call(rbind, lapply(myList, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x))

This returns

  student.id teacher.id rating
1          1          1    100
2          2          1     89
3          3          1     99
4          4          2     87
5          5          2     24
6          6          2     52
7          7          2     99
8          8          2     79
9          9          2     12

If the same student.id rates multiple teachers, then this method requires the construction of a new variable with the interaction function:

# create new interaction variable
df$stud.teach <- interaction(df$student.id, df$teacher.id)

myList <- split(df, df$stud.teach)

then the remainder of the code is identical to that above.

A potentially faster method is to use the data.table library and rbindlist.

library(data.table)
# convert into a data.table
setDT(df)

myList <- split(df, df$stud.teach)

# put together data.frame with rbindlist
rbindlist(lapply(myList, function(x) if(nrow(x) > 1) x[sample(nrow(x), 1), ] else x))

what would change if a student rated multiple teachers? i can update my data. — bfoste01, Jul 21 '16 at 18:45
The split would have to be on a variable that interacts the teacher and the student ids. See my updated answer. — lmo, Jul 21 '16 at 18:47
Fantastic. That helps a lot! Is there a way to speed that code up? I have 100,000 IDS, so it is quite slow to converge to a solution in the final do.call or is this as fast as it gets? — bfoste01, Jul 21 '16 at 19:04
I added a `data.table` method that may speed things up a bit. — lmo, Jul 21 '16 at 19:12

score 0 · Answer 2 · answered Feb 11 '22 at 14:43

0

This can now be done much faster using data.table. Your question is equivalent to sampling rows from within groups, see

Sample random rows within each group in a data.table

answered Feb 11 '22 at 14:43

Michael Lachanski

57
7

Select random rows from duplicate IDS

2 Answers2