calculate median from data.table columns in R

Question

I am trying to calculate a median value across a number of columns, however my data is a bit funky. It looks like the following example.

library(data.table)

dt <- data.table("ID" = c(1,2,3,4),"none" = c(0,5,5,3), 
                 "ten" = c(3,2,5,4),"twenty" = c(0,2,3,1))


   ID none ten twenty
1:  1    0   3      0
2:  2    5   2      2
3:  3    5   5      3
4:  4    3   4      1

In the table to column represents the number of occurrences of that value. I am wanting to calculate the median occurrence.

For example for ID = 1

median(c(10, 10, 10))

is the calculation I am wanting to create.

for ID = 2

median(c(0, 0, 0, 0, 0, 10, 10, 20, 20))

I have tried using rep() and lapply() with very limited success and am after some clear guidance on how this might be achieved. I understand for the likes of rep() I would be having to hard code my value to be repeated (e.g. rep(0,2) or rep(10,2)) and this is what I expect. I am just struggling to create a list or vector with the repetitions from each column.

The code to create the `dt` in your example doesn't match the description of the problem. It's only a typo but confused me initially because all answers differ from your expected result. For ID = 2 the numbers for `"none"` and `"twenty"` are interchanged. With your code you will get for ID = 2 `median(0,0,0,0,0,10,10,20,20)`. — Uwe, Jun 02 '16 at 07:08
@Uwe Okay, good catch. I've tried to fix it, so that the displayed stuff matches the example input. I doubt the OP would mind edits if there are problems; they can always roll them back if we're wrong about it. — Frank, Jun 02 '16 at 18:31

score 16 · Accepted Answer · answered Jun 01 '16 at 22:43

16

Here's another data.table way (assuming unique ID):

dt[, median(rep(c(0, 10, 20), c(none, ten, twenty))), by=ID]
#    ID V1
# 1:  1 10
# 2:  2  0
# 3:  3 10
# 4:  4 10

This is just an attempt to get @eddi's answer without reshaping (which I tend to use as a last resort).

answered Jun 01 '16 at 22:43

Arun

116,683
26
284
387

Nice clean single line of code which is what i was trying to achieve. I have to hardcode my values so this best answers the question, although I can also see how it would be suitable if my values were in a column too. – Dan Jun 03 '16 at 00:51

score 12 · Answer 2 · answered Jun 01 '16 at 21:55

You need a dictionary to translate column names to corresponding numbers, and then it's fairly straightforward:

dict = data.table(name = c('none', 'ten', 'twenty'), number = c(0, 10, 20))

melt(dt, id.var = 'ID')[
  dict, on = c(variable = 'name')][, median(rep(number, value)), by = ID]
#   ID V1
#1:  1 10
#2:  2  0
#3:  3 10
#4:  4 10

score 6 · Answer 3 · edited May 23 '17 at 12:16

6

Here's a way that avoids by-row operations and reshaping:

dt[, m := {
    cSD  = Reduce(`+`, .SD, accumulate=TRUE)
    k    = floor(cSD[[length(.SD)]]/2)

    m    = integer(.N)
    for(i in seq_along(cSD)) {
        left = m == 0L
        if(!any(left)) break
        m[left] = i * (cSD[[i]][left] >= k[left])
    }
    names(.SD)[m]
}, .SDcols=none:twenty]

which gives

   ID none ten twenty    m
1:  1    0   3      0  ten
2:  2    5   2      2 none
3:  3    5   5      3  ten
4:  4    3   4      1  ten

For the loop, I'm borrowing @alexis_laz' style, e.g. https://stackoverflow.com/a/30513197/

I've skipped translation of the column names, but that's pretty straightforward. You could use c(0,10,20) instead of names(.SD) at the end.

edited May 23 '17 at 12:16

Community

1
1

answered Jun 02 '16 at 00:20

Frank

66,179
8
96
180

The `.SDcols = none:twenty` is nifty. Didn't know you could do that. Also, what is `.N`? – CJB Jun 02 '16 at 08:50
2

@Bazz Yeah, that shortcut for .SDcols is a fairly recent addition. `.N` refers to the number of rows in the table, or, if there's a `by=` clause (like in Arun's answer), it refers to the number of rows in the by-group. – Frank Jun 02 '16 at 12:02

Psidom · Answer 4 · 2016-06-01T23:21:22.253

Here is a rowwise dplyr way:

dt %>% rowwise %>% 
       do(med = median(c(rep(0, .$none), rep(10, .$ten), rep(20, .$twenty)))) %>%  
       as.data.frame
  med
1  10
2   0
3  10
4  10

Inspired by @Arun's answer, this is also working:

dt %>% group_by(ID) %>% 
       summarise(med = median(rep(c(0, 10, 20), c(none, ten, twenty))))

Source: local data table [4 x 2]

     ID   med
  (dbl) (dbl)
1     1    10
2     2     0
3     3    10
4     4    10

calculate median from data.table columns in R

4 Answers4