group_by edit distance between rows over multiple columns

Question

I have the following data frame.

Input:

class   id  q1  q2  q3  q4
Ali     12  1   2   3   3
Tom     16  1   2   4   2
Tom     18  1   2   3   4
Ali     24  2   2   4   3
Ali     35  2   2   4   3
Tom     36  1   2   4   2

class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions

Requirement:

I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated. For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:

difference: For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
cheating: This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.

class   id  q1  q2  q3  q4  difference                  cheating
Ali     12  1   2   3   3   (12,24 = 2), (12,35 = 2)    NA
Tom     16  1   2   4   2   (16,18 = 3), (16,36 = 0)    36
Tom     18  1   2   3   4   (16,18 = 3), (18,36 = 3)    NA
Ali     24  2   2   4   3   (12,24 = 2), (24,35 = 0)    35
Ali     35  2   2   4   3   (12,35 = 2), (24,35 = 0)    24
Tom     36  1   2   4   2   (16,36 = 0), (18,36 = 3)    16

Is it possible to achieve this using dplyr?

dput()

For your convenience, I am sharing data dput():

structure(list(class = 
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"), 
id = c(12L, 16L, 18L, 24L, 35L, 36L), 
q1 = c(1L, 1L, 1L, 2L, 2L, 1L), 
q2 = c(2L, 2L, 2L, 2L, 2L, 2L), 
q3 = c(3L, 4L, 3L, 4L, 4L, 4L), 
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")

Any help would be greatly appreciated!

score 1 · Answer 1 · answered Dec 13 '22 at 23:33

You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.

This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.

df<- structure(list(class = 
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"), 
id = c(12L, 16L, 18L, 24L, 35L, 36L), 
q1 = c(1L, 1L, 1L, 2L, 2L, 1L), 
q2 = c(2L, 2L, 2L, 2L, 2L, 2L), 
q3 = c(3L, 4L, 3L, 4L, 4L, 4L), 
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")

#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))

plot(scores)

#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)

#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)

#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]

potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings) 
potential_cheaters

This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).

group_by edit distance between rows over multiple columns

Input:

Requirement:

Related posts:

dput()

1 Answers1

Linked