group_by row wise distance between multiple columns

Question

Posting an updated question from my last post as I think my requested output was a bit too complicated group_by edit distance between rows over multiple columns. Here, I have simplified the desired output.

I have the following data frame.

Input:

class   id  q1  q2  q3  q4
Ali     12  1   2   3   3
Tom     16  1   2   4   2
Tom     18  1   2   3   4
Ali     24  2   2   4   3
Ali     35  2   2   4   3
Tom     36  1   2   4   2

class indicates the teacher's name,
id indicates the student user ID, and,
q1, q2, q3 and q4 indicate marks on different test questions

Requirement:

I am interested in finding potential cases of cheating. I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated. I realise it's a very naive assumption - I have just made up this example so that I can use this code in another context.

For example, looking at the class Tom, the IDs 16 and 36 have the same scores across the columns q1, q2, q3 and q4, which means that their score difference is zero (the difference between corresponding entries is zero).

Therefore, I want to calculate the row-wise sum of the absolute difference between multiple columns where the rows are grouped by class name. Based on this absolute difference, I want to include additional columns such as diff0, diff1, diff2 and diff3:

diff0: For a given ID, this column lists all the other id's that have a pairwise difference of zero with this student, given that they belong to the same class.
diff1: For a given ID, this column lists all the other id's that have a pairwise difference of one with this student, given that they belong to the same class. and lastly,
diff2: For a given ID, this column lists all the other id's that have a pairwise difference of three with this students, given that they belong to the same class etc.

My desired output is as below:

class   id  q1  q2  q3  q4  diff0   diff1   diff2   diff3
Ali     12  1   2   3   3   NA      NA      24,35   NA
Tom     16  1   2   4   2   36      NA      NA      18
Tom     18  1   2   3   4   NA      NA      NA      16,36
Ali     24  2   2   4   3   35      NA      12      NA
Ali     35  2   2   4   3   24      NA      12      NA
Tom     36  1   2   4   2   16      NA      NA      18

Is it possible to achieve this using dplyr?

dput()

For your convenience, I am sharing data dput():

structure(list(class = 
c("Ali", "Tom", "Tom", "Ali", "Ali", "Tom"), 
id = c(12L, 16L, 18L, 24L, 35L, 36L), 
q1 = c(1L, 1L, 1L, 2L, 2L, 1L), 
q2 = c(2L, 2L, 2L, 2L, 2L, 2L), 
q3 = c(3L, 4L, 3L, 4L, 4L, 4L), 
q4 = c(3L, 2L, 4L, 3L, 3L, 2L)), row.names = c(NA, -6L), class = "data.frame")

Any help on this would be greatly appreciated!

*I hypothesise that if the students are in the same class and have similar scores on different questions, they are likely to have cheated*...A bit strong/unfair assumption :) — Sotos, Dec 14 '22 at 09:11
I agree about the assumption being strong, but in reality, I am not using this code in this context. I just made it up to make my requirements simpler and understandable :) — Sandy, Dec 14 '22 at 09:14

Stefano Barbi · Accepted Answer · 2022-12-15T08:49:42.963

A possible solution using dist to calculate distances with the manhattan method.

library(dplyr)
library(tidyr)

df |>
    group_by(class) |>
    group_modify(~{
        ## prepare the dataframe for pairwise dist calculation using `dist'
        ## add rownames to dataframe so that results can be easily
        ## delt with after
        tibble::column_to_rownames(.x, "id") |>
            select(q1:q4) |>
            dist(method = "manhattan") |>
            as.matrix() |>
            ## transform the distance matrix into a <<long>> dataframe
            ## consisting of pairwise comparisons between 
            ## `id' and `other'
            as_tibble(rownames = "id") |>
            pivot_longer(-id,
                         names_to = "other",
                         values_to = "diff") |>
            ## filter out comparisons of `id' with itself 
            filter(id != other) |>
            ## expand the dataframe as the cartesian product of 
            ## its rows and the wanted `cutoff's (0:3)   
            crossing(cutoff = 0:3) |>
            group_by(cutoff,id) |>
            ## for each cutoff and id combine the values in `other`
            ## that match cutoff and transform to the wanted <<wide>> format              
            summarize(other = paste(other[diff == cutoff], collapse = ",")) |>
            pivot_wider(names_from = cutoff,
                        values_from = other,
                        names_prefix = "diff")            
    }) |>
    ## this step is needed so that `id' has the same type of the 
    ## original dataframe
    mutate(id = as.integer(id)) |>
    right_join(df)

##> # A tibble: 6 × 10
##> # Groups:   class [2]
##>   class    id diff0 diff1 diff2   diff3      q1    q2    q3    q4
##>   <chr> <int> <chr> <chr> <chr>   <chr>   <int> <int> <int> <int>
##> 1 Ali      12 ""    ""    "24,35" ""          1     2     3     3
##> 2 Ali      24 "35"  ""    "12"    ""          2     2     4     3
##> 3 Ali      35 "24"  ""    "12"    ""          2     2     4     3
##> 4 Tom      16 "36"  ""    ""      "18"        1     2     4     2
##> 5 Tom      18 ""    ""    ""      "16,36"     1     2     3     4
##> 6 Tom      36 "16"  ""    ""      "18"        1     2     4     2

Thank you for the solution, could you please explain/elaborate it? — Sandy, Dec 15 '22 at 05:36
It seems to be working for my original data. Thank you! Once you have explained your code, I will accept your solution. Many many thanks :) — Sandy, Dec 15 '22 at 06:08
@Sandy I added comments in the code because, being a long 'one-liner' in the style of tidyverse, it is quite unnatural to break it down into sections. Anyway, i hope that it is clearer now! — Stefano Barbi, Dec 15 '22 at 09:06

group_by row wise distance between multiple columns

Input:

Requirement:

Related posts:

dput()

1 Answers1