0

I have the following data frame.

Input:

class   id  q1  q2  q3  q4
Ali     12  1   2   3   3
Tom     16  1   2   4   2
Tom     18  1   2   3   4
Ali     24  2   2   4   3
Ali     35  2   2   4   3
Tom     36  1   2   4   2
  1. class indicates the teacher's name,
  2. id indicates the student user ID, and,
  3. q1, q2, q3 and q4 indicate marks on different test questions

Requirement:

I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated. For this, I want to calculate absolute distance or difference, grouped by class name, across multiple columns, i.e., all the test questions q1, q2, q3 and q4. And I want to store this information in a couple of new columns as below:

  1. difference: For a given class name, it contains the pairwise distance or difference with all other students' id. For a given class name, it stores the information as (id1, id2 = difference)
  2. cheating: This column lists any id's based on the previously created new column where the difference was zero (or some threshold value). This will be a flag to alert the teacher that their student might have cheated.
class   id  q1  q2  q3  q4  difference                  cheating
Ali     12  1   2   3   3   (12,24 = 2), (12,35 = 2)    NA
Tom     16  1   2   4   2   (16,18 = 3), (16,36 = 0)    36
Tom     18  1   2   3   4   (16,18 = 3), (18,36 = 3)    NA
Ali     24  2   2   4   3   (12,24 = 2), (24,35 = 0)    35
Ali     35  2   2   4   3   (12,35 = 2), (24,35 = 0)    24
Tom     36  1   2   4   2   (16,36 = 0), (18,36 = 3)    16

Is it possible to achieve this using dplyr?

Related posts:

I have tried to look for related solutions but none of them address the exact problem that I am facing e.g.,

  1. This post calculates the difference between all pairs of rows. It does not incorporate the group_by situation plus the solution is extremely slow: R - Calculate the differences in the column values between rows/ observations (all combinations)

  2. This one compares only two columns using stringdist(). I want my solution over multiple columns and with a group_by() condition: Creating new field that shows stringdist between two columns in R?

  3. The following post compares the initial values in a column with their preceding values: R Calculating difference between values in a column

  4. This one compares values in one column to all other columns. I would want this but done row wise and through group_by(): R Calculate the difference between values from one to all the other columns

dput()

For your convenience, I am sharing data dput():

structure(list(class = 
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"), 
id = c(12L, 16L, 18L, 24L, 35L, 36L), 
q1 = c(1L, 1L, 1L, 2L, 2L, 1L), 
q2 = c(2L, 2L, 2L, 2L, 2L, 2L), 
q3 = c(3L, 4L, 3L, 4L, 4L, 4L), 
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")

Any help would be greatly appreciated!

Sandy
  • 1,100
  • 10
  • 18

1 Answers1

1

You could try to clustering the data, using hclust() for example. Once the relative distances are calculated and mapped, the cut the tree at the threshold of expected cheating.

This example I am using the standard dist() function to calculate differences, the stringdist function may be better or maybe another option is out there to try.

df<- structure(list(class = 
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"), 
id = c(12L, 16L, 18L, 24L, 35L, 36L), 
q1 = c(1L, 1L, 1L, 2L, 2L, 1L), 
q2 = c(2L, 2L, 2L, 2L, 2L, 2L), 
q3 = c(3L, 4L, 3L, 4L, 4L, 4L), 
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")

#apply the standard distance function
scores <- hclust(dist(df[ , 3:6]))

plot(scores)

#divide into groups based on level of matching too closely
groups <- cutree(scores, h=0.1)

#summary table
summarytable <- data.frame(class= df$class, id =df$id, groupings =groups)

#select groups with more than 2 people in them
suspectgroups <- table(groups)[table(groups) >=2]

potential_cheaters <- summarytable %>% filter(groupings %in% names(suspectgroups)) %>% arrange(groupings) 
potential_cheaters

This works for this test case, but for larger datasets the height in the cutree() function may need to be adjusted. Also consider splitting the initial dataset by class to eliminate the chance of matching people between classes (depending on the situation of course).

Dave2e
  • 22,192
  • 18
  • 42
  • 50