4

I have a fairly large data.table (15M rows, 15 columns) for which I want to calculate the median of each row. I can do this using

apply(DT, 1, median)  # DT is my data.table

but this is very slow. Is there a faster, data.table-friendly alternative?

As a small working example, if I have

DT = data.table(a = c(1, 2, 4), b = c(6, 4, 7), 
                c = c(3, 9, 9), d = c(18, 1, -5))
#    a b c  d
# 1: 1 6 3 18
# 2: 2 4 9  1
# 3: 4 7 9 -5

what is the most efficient way of computing the row medians?

apply(DT, 1, median)
# [1] 4.5 3.0 5.5
Kevin Gori
  • 151
  • 6

1 Answers1

7

An option is to use the rowMedians-function from the package:

library(matrixStats)
DT[, med := rowMedians(as.matrix(.SD))][]

which gives:

> DT
   a b c  d med
1: 1 6 3 18 4.5
2: 2 4 9  1 3.0
3: 4 7 9 -5 5.5

Or with only data.table:

DT[, med := melt(DT, measure.vars = names(DT))[, r := 1:.N, variable][, median(value), by = r]$V1][]
Jaap
  • 81,064
  • 34
  • 182
  • 193