Posting an updated question from my last post as I think my requested output was a bit too complicated group_by edit distance between rows over multiple columns. Here, I have simplified the desired output.
I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
- class indicates the teacher's name,
- id indicates the student user ID, and,
- q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated. I realise it's a very naive assumption - I have just made up this example so that I can use this code in another context.
For example, looking at the class Tom, the IDs 16 and 36 have the same scores across the columns q1, q2, q3 and q4, which means that their score difference is zero (the difference between corresponding entries is zero).
Therefore, I want to calculate the row-wise sum of the absolute difference between multiple columns where the rows are grouped by class name. Based on this absolute difference, I want to include additional columns such as diff0, diff1, diff2 and diff3:
- diff0: For a given ID, this column lists all the other id's that have a pairwise difference of zero with this student, given that they belong to the same class.
- diff1: For a given ID, this column lists all the other id's that have a pairwise difference of one with this student, given that they belong to the same class. and lastly,
- diff2: For a given ID, this column lists all the other id's that have a pairwise difference of three with this students, given that they belong to the same class etc.
My desired output is as below:
class id q1 q2 q3 q4 diff0 diff1 diff2 diff3
Ali 12 1 2 3 3 NA NA 24,35 NA
Tom 16 1 2 4 2 36 NA NA 18
Tom 18 1 2 3 4 NA NA NA 16,36
Ali 24 2 2 4 3 35 NA 12 NA
Ali 35 2 2 4 3 24 NA 12 NA
Tom 36 1 2 4 2 16 NA NA 18
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions, but none of them addresses the exact problem that I am facing, e.g.,
R - Calculate the differences in the column values between rows/ observations (all combinations), Creating new field that shows stringdist between two columns in R?, R Calculating difference between values in a column, and, R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help on this would be greatly appreciated!