1
  • I have data that includes student scores in their mid-terms and their final exams.

  • This data is arranged in a wide format, where each row corresponds to a unique student ID shown as SUID.

  • My data also includes information about teachers, which is shown by TUserId. Each teacher can have multiple students, hence there is a common teacher ID across several rows.

  • I am interested in finding out if there are any instances where a teacher who gave similar scores to their students on mid-terms (as shown by mid_sum) gave inconsistent scores to their students on their final exams (as shown by final_sum). To record such inconsistency, I want to add a column Status that records this mismatch or inconsistency.

Input:

My data df looks something like this: :

TUserId  SUID   mid_sum final_sum
 115      201   7       1
 115      309   8       2
 115      404   9       1
 209      245   10      2
 209      398   10      2
 209      510   10      2
 209      602   10      1
 371      111   11      2
 371      115   11      2
 371      123   11      3
 371      124   11      2

Output:

For my output, I require something like this: :

TUserId  SUID   mid_sum final_sum   Status
 115      201   7       1           consistent
 115      309   8       2           consistent
 115      404   9       1           inconsistent
 209      245   10      2           consistent
 209      398   10      2           consistent
 209      510   10      2           consistent
 209      602   10      1           inconsistent
 371      111   11      2           consistent
 371      115   11      2           consistent
 371      123   11      3           inconsistent
 371      124   11      2           consistent
 
Requirements:

My requirements are as below:

Mid Term Score Requirements:

1- Students cannot have higher final grades (relative to each other) when their midterm scores were lower. For example, student SUID = 309 has higher final grade when their mid-term score was lower than the student SUID = 404. In such a case I want to flag SUID = 404 as inconsistent.

2- Students with similar mid-term scores also cannot have different final grades. For example, student SUID = 602 has lower final grade when their mid-term score was equal to the other students of the teacher TUserId = 209. Similarly, the student SUID = 123 has higher final grade when their mid-term score was equal to the other students of the teacher TUserId = 371.

Final Score Requirements:

1- However, the same final score can be assigned to students with different mid-term scores. I realise that this requirement is a bit confusing. As long as the mid-term scores stay the same or they increase, the final grades can stay the same. But not the other way around i.e., if the mid-term score starts decreasing within that teacher then the final grade cannot stay the same.

2- Moreover, if the mid-term scores are increasing, the final scores can also increase (or stay the same as the previous value).

Data import dput()

The dput() for the data frame is below:

dput(df)

structure(list(
TUserId = c(115L, 115L, 115L, 209L, 209L, 209L, 209L, 371L, 371L, 371L, 371L), 
SUID = c(201L, 309L, 404L, 245L, 398L, 510L, 602L, 111L, 115L, 123L, 124L), 
mid_sum = c(7L, 8L, 9L, 10L, 10L, 10L, 10L, 11L, 11L, 11L, 11L),
final_sum = c(1L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 3L, 2L)), 
class = "data.frame", row.names = c(NA, -11L))
Note:
  • I have students' mid_sum and final_sum scores sorted in ascending order. I just want to identify the cases where there is an inconsistent assignment of marks.

  • From an implementation perspective, comparisons are always made to the previous value.

  • I am re-posting my question as the last example did not clarify my exact requirements Identify cases where data sequence changes based on other column UserIDs.

Partial Solution:

The following solution is partially meeting my requirements but does not capture the cases where the students with similar mid-term scores have a higher final score.

library(dplyr)
df %>%
  arrange(TUserId, mid_sum) %>%
  group_by(TUserId) %>%
  mutate(
    Status = if_else(
      sign(final_sum - lag(final_sum, default = 0) + lead(final_sum, default = 0)) 
      == sign(mid_sum - lag(mid_sum, default = 0)  + lead(mid_sum, default = 0)),
      "consisent", "inconsistent"
    )
  )
Sandy
  • 1,100
  • 10
  • 18

1 Answers1

1

Pretty cool Problem. Your Question is well explained.

Consider this code:

## Rule 1
# we sort by mid sum first, then final sum
# if the cumululative max of the final sum is higher than the current finalsum,
# the mid sum had to be lower

df <- df %>% arrange(mid_sum,final_sum) %>% mutate(inconsistentRule1 = cummax(final_sum)>final_sum)


# Rule 2
# This is a shot in the dark as the inconsitency criteria is a bit fuzzy
# (What if a teacher with only two students on same mid_level assigns different grades,
# which  student is to be considered "inconsistent"? The lower or the higher graded)
# i just used the median, in this case students that deviate from the norm
# are considered the inconsistent ones, works with your example
df <- df %>% group_by(TUserId,mid_sum) %>% mutate(inconsistentRule2= final_sum != median(final_sum))

# combine the rules
df <- df %>% ungroup() %>% 
  mutate(Status=ifelse(
    inconsistentRule1 | inconsistentRule2,
    "inconsistent",
    "consistent"))

# put in order and delete working columns
df %>% arrange(TUserId,SUID) %>%
  select(-c("inconsistentRule1","inconsistentRule2"))

The result is your desired table

Sandwichnick
  • 1,379
  • 6
  • 13
  • Thank you for this solution, I will try it on my original data. I like how different steps are easy for me to understand and follow. – Sandy Jun 16 '21 at 22:25