Rolling comparison (per row) between two columns

Question

I am trying to write a code that extracts the ID of the teacher who assigned the same mid_term score to their students but then gave different scores on their finals.

While I'm interested in identifying the case where mid_term scores were the same for different final_scores, I am not interested in the converse relation i.e., it doesn't matter to me if the same final_scores are given for different mid_term scores.

Input:

My input df looks something like this: :

TUserId  SUID   mid_sum final_sum
 115      201   7       1
 115      309   7       2
 209      245   10      2
 209      398   10      2
 209      510   10      3
 302      423   8       1
 302      456   8       1
 412      189   6       1
 412      191   7       1

Output:

In the output I am just looking for a way to have the following flags with the teacher IDs

TUserId  Flag
 115     inconsistent
 209     inconsistent
 302     consistent
 412     consistent

OR

TUserId  Flag
 115     TRUE
 209     TRUE
 302     FALSE
 412     FALSE

Requirements:

When the value of mid_sum is the same across different students (shown by SUID) of the same teacher (shown by TUserId), it puts inconsistent in an additional column. I'm mainly looking for extracting the teacher ID where such discrepancy appeared.

This is related to the following two posts but one of the solutions meet my requirements.

Rolling computation to identify mismatch between two columns and Finding if a value is within the range of other columns

Any help on this would be greatly appreciated.

Why is SUID 398 `consistent`? it has the same `mid_sum` and `TUserId` as 245. — r2evans, Jul 15 '21 at 00:28
That's a good question. It doesn't matter actually which one we call consistent or inconsistent. I just want to be able to pick up teacher Ids (TUserId) that showed this trend. Let me update my example. — Sandy, Jul 15 '21 at 00:30
Whoa, that's a really big change to your sample data. And inconsistent. Why are `TUserId` 302's results all consistent with identical `mid_sum`? — r2evans, Jul 15 '21 at 00:36
I mean, I am just interested in picking the teacher Id, so it really doesn't matter which entry is labelled consistent or inconsistent as long as there is an anomaly. — Sandy, Jul 15 '21 at 00:38
Perhaps I misunderstand. I thought that if two (or more) students with the same teacher have the same `mid_sum`, then some (in your first sample data) or all (in this sample data) of those students are flagged as "inconsistent". However, teacher 302 has two students with a `mid_sum` of 8 who are flagged as "consistent". Where am I going wrong? — r2evans, Jul 15 '21 at 00:42
I apologise for the confusion. I have further clarified my question, does it address your concerns? — Sandy, Jul 15 '21 at 00:45

r2evans · Accepted Answer · 2021-07-15T01:22:22.837

Third attempt :-)

do.call(rbind, by(dat, dat$TUserId, FUN = function(z) {
  data.frame(
    TUserId = z$TUserId[1],
    Flag = any(table(unique(subset(z, select=c("mid_sum","final_sum")))$mid_sum) > 1)
  )
}))
#     TUserId  Flag
# 115     115  TRUE
# 209     209  TRUE
# 302     302 FALSE
# 412     412 FALSE

Walk-through, inside-out:

unique(subset(..)) reduces the number of unique combinations of mid_sum and final_sum; the premise that either they should both be the same or both be different is the key here;
table(unique(..)$mid_sum) counts the number of different values; since we've reduced all of the mid/final pairs, if we see a mid_sum value that occurs more than once, it means that we have something wrong here;
any(..) will return a single TRUE for this one teacher (since we're grouping ... see by below) if any of the rows show inconsistency, we don't care which one;
data.frame(..) is just a convenience for combining afterwards, there are certainly different ways to deal with the expected output;
by(dat, dat$TUserId, ..) groups the data by each teacher, and passes to the anonymous function as its z parameters; the first time this function is called, all it sees as z is dat[1:2,] (id 115); the second time this anon-func is called, all it sees as z is dat[3:5,] (id 209); etc;
do.call(rbind, ..): the return value from by is a list, and because of the data.frame(.) above, it is a list of frames; the best and fastest way to combine this list of frames into a single frame is to do this trick. There are other similar and/or better methods out of base R (e.g., dplyr::bind_rows and data.table::rbindlist).

Thank you for the solution. Let me try it on my main data set. — Sandy, Jul 15 '21 at 00:34
Also, the generated solution is slightly incorrect, I want the first two teacher IDs to be TRUE and the last two IDs to be FALSE. — Sandy, Jul 15 '21 at 01:05
Fixed, you editing your question to stress "different" was finally seen by my eyes. — r2evans, Jul 15 '21 at 01:22
Thank you for the detailed explanation. It seems to be working well, I just ran on partial data :) — Sandy, Jul 15 '21 at 01:28

Rolling comparison (per row) between two columns

Input:

Output:

Requirements:

1 Answers1