Get Median implicitly in R from 2 columns

Question

I am trying to figure out how to manipulate the data here. The image shows only one course but I have multiple courses and course numbers that range from 2010 to 2017. How should I go about adding a column that shows the median grade for a certain course based on the year, taught and term? We have the number of kids who got a certain grade but not the actual grades. I am expecting that the median grade column would should 11 duplicates for the 11 different grades based on each "taught" variable. Taught can only have two values, either "here" or "there".

I have tried using the aggregate function but this problem doesn't seem like something that can be solved with high-level functions. The db is DBKids in R. I just can't seem to think of a way that can help me with this issue. Thanks!

Edit: Reproducible code

structure(list(sessionYear = c(2010, 2010, 2010, 2010, 2010, 
 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, 
 2010, 2010, 2010, 2010, 2010, 2010), courseNumber = c("20", "20",
 "20", "20", "20", "20", "20", "20", "20", "20", "20", "20", "20",
 "20", "20", "20", "20", "20", "20", "20", "20", "20"),
 courseName = c("KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
 "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
 "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
 "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn", "KidsLearn",
 "KidsLearn", "KidsLearn", "KidsLearn"), Taught = c("There", 
 "Here", "There", "Here", "There", "Here", "There", 
 "Here", "There", "Here", "There", "Here", "There", 
 "Here", "There", "Here", "There", "Here", "There", 
 "Here", "There", "Here"), Term = c("1", "1", "1", "1", "1", 
 "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
 "1", "1", "1", "1"), averageGrade = c(83, 84, 83, 84, 83, 84, 
 83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84, 83, 84
 ), Grade = c("F", "F", "D", "D", "C3", "C3", "C2", "C2", "C1", 
 "C1", "B3", "B3", "B2", "B2", "B1", "B1", "A3", "A3", "A2", "A2", 
 "A1", "A1"), numberOfKids = c(1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 
 3, 0, 3, 2, 6, 0, 14, 7, 24, 4, 18, 4)), class = "data.frame", row.names = c(NA, 
 -22L), .Names = c("sessionYear", "courseNumber", "courseName", 
 "Taught", "Term", "averageGrade", "Grade", "numberOfKids"))

Hope this helps.

Pick and follow your favorite group/aggregate method from the R-FAQ [How to calculate mean by group?](https://stackoverflow.com/q/11562656/903061), but replace `mean` with `weighted.median` from the `spatstat` package. ([Or any of the other packages that implement it](https://www.rdocumentation.org/search?q=weighted.median&latest=1)). You probably will need to convert the factor grade levels to integer with `as.integer` (make sure the levels are ordered correctly!) — Gregor Thomas, Oct 03 '17 at 20:57
It's easier to help you if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). Pictures of data are not very useful. Try a `dput()` or use a sample built in dataset to illustrate your problem. Give the desired output. — MrFlick, Oct 03 '17 at 21:02
@Gregor, I did use that but the problem is that Grade is "character" type. If I find the median number of people let's say 23, how does R know which Grade to assign to the median grade column? Because we are saying that the median is 23, I need to find values of grades A1 to F where 23 represents a Grade, like A2 for instance. — Gaurav Vasudev, Oct 03 '17 at 21:34

score 1 · Answer 1 · answered Oct 03 '17 at 22:28

So each entry numberOfKids is the number of kids who got the corresponding grade in Grade? You can get the median "by hand" by doing

get_median = function(numberOfKids,Grade){
 current_count = 0
 middle = (sum(numberOfKids)+1)/2
 for (i in 1:length(numberOfKids)){
  current_count = current_count+numberOfKids
   #if we're halfway through the class, return the current grade
   if (current_count == middle) return(Grade[i])
   #if we're more than halfway through the class, then decide whether
   #the middle is closer to the current total or the previous 
   if (current_count > middle){
    if ((current_count-middle) > numberOfKids[i]/2) return(Grade[i])
    return(Grade[i-1] } } }

Normally with a median, if there's a "tie", you take the average of the two values, but you can't really take the average of two grades, so you have to decide which one to pick. With this function, if there's a complete tie, it takes the lower grade. You can change that by changing the last ">" to ">=".

There might be a slightly simpler way with `cumsum` and `findInterval`. Nice approach! — Gregor Thomas, Oct 03 '17 at 23:01

score 0 · Accepted Answer · answered Oct 03 '17 at 21:45

First we'll make a factor of grade, making sure the levels are in the right order. This we can convert to numeric so we have numbers to take the median of.

levels(factor(dd$Grade))
# [1] "A1" "A2" "A3" "B1" "B2" "B3" "C1" "C2" "C3" "D"  "F" 

## order seems good

dd$grade_numeric = as.numeric(factor(dd$Grade))

Now we do the median by group, weighted by number of kids, round to the nearest integer and convert back to letter grade.

library(dplyr)
group_by(dd, sessionYear, Taught, Term) %>%
    mutate(med = spatstat::weighted.median(x = grade_numeric, w = numberOfKids),
           med = round(med),
           median_Grade = levels(factor(Grade))[med]) %>%
    print.data.frame
#    sessionYear courseNumber courseName Taught Term averageGrade Grade numberOfKids grade_numeric med median_Grade
# 1         2010           20  KidsLearn  There    1           83     F            1            11   2           A2
# 2         2010           20  KidsLearn   Here    1           84     F            0            11   2           A2
# 3         2010           20  KidsLearn  There    1           83     D            0            10   2           A2
# 4         2010           20  KidsLearn   Here    1           84     D            0            10   2           A2
# 5         2010           20  KidsLearn  There    1           83    C3            1             9   2           A2
# 6         2010           20  KidsLearn   Here    1           84    C3            0             9   2           A2
# 7         2010           20  KidsLearn  There    1           83    C2            1             8   2           A2
# 8         2010           20  KidsLearn   Here    1           84    C2            0             8   2           A2
# 9         2010           20  KidsLearn  There    1           83    C1            0             7   2           A2
# 10        2010           20  KidsLearn   Here    1           84    C1            0             7   2           A2
# 11        2010           20  KidsLearn  There    1           83    B3            3             6   2           A2
# 12        2010           20  KidsLearn   Here    1           84    B3            0             6   2           A2
# 13        2010           20  KidsLearn  There    1           83    B2            3             5   2           A2
# 14        2010           20  KidsLearn   Here    1           84    B2            2             5   2           A2
# 15        2010           20  KidsLearn  There    1           83    B1            6             4   2           A2
# 16        2010           20  KidsLearn   Here    1           84    B1            0             4   2           A2
# 17        2010           20  KidsLearn  There    1           83    A3           14             3   2           A2
# 18        2010           20  KidsLearn   Here    1           84    A3            7             3   2           A2
# 19        2010           20  KidsLearn  There    1           83    A2           24             2   2           A2
# 20        2010           20  KidsLearn   Here    1           84    A2            4             2   2           A2
# 21        2010           20  KidsLearn  There    1           83    A1           18             1   2           A2
# 22        2010           20  KidsLearn   Here    1           84    A1            4             1   2           A2

In this example there are only 2 groups (Term and Year only have one value each), and both of them have a median grade of A2. (Scroll to the right to see the added columns.)

Get Median implicitly in R from 2 columns

2 Answers2