I have the following data frame.
Input:
class id q1 q2 q3 q4
Ali 12 1 2 3 3
Tom 16 1 2 4 2
Tom 18 1 2 3 4
Ali 24 2 2 4 3
Ali 35 2 2 4 3
Tom 36 1 2 4 2
- class indicates the teacher's name,
- id indicates the student user ID, and,
- q1, q2, q3 and q4 indicate marks on different test questions
Requirement:
I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated. For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:
- difference: For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
- cheating: This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class id q1 q2 q3 q4 difference cheating
Ali 12 1 2 3 3 (12,24 = 2), (12,35 = 2) NA
Tom 16 1 2 4 2 (16,18 = 3), (16,36 = 0) 36
Tom 18 1 2 3 4 (16,18 = 3), (18,36 = 3) NA
Ali 24 2 2 4 3 (12,24 = 2), (24,35 = 0) 35
Ali 35 2 2 4 3 (12,35 = 2), (24,35 = 0) 24
Tom 36 1 2 4 2 (16,36 = 0), (18,36 = 3) 16
Is it possible to achieve this using dplyr?
Related posts:
I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,
This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)
This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?
The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column
This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns
dput()
For your convenience, I am sharing data dput():
structure(list(class =
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"),
id = c(12L, 16L, 18L, 24L, 35L, 36L),
q1 = c(1L, 1L, 1L, 2L, 2L, 1L),
q2 = c(2L, 2L, 2L, 2L, 2L, 2L),
q3 = c(3L, 4L, 3L, 4L, 4L, 4L),
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")
Any help would be greatly appreciated!