Fastest way to select i-th highest value from row and assign to new column

Question

I'm looking for a solution to add a new column to an existing dataframe / datatable which is the i-th highest value from each individual row. For example, if I want the 4th highest value, the new column will contain 1.9 for the first row.

data <- data.frame(a = c("a","a","b","b","c","a"),
                   peak1 = c(1.1,2.5,2.4,2.1,2.5,2.6),
                   peak2 = c(1.2,2.5,2.4,2.1,2.5,2.6),
                   peak3 = c(1.3,2.5,2.4,2.1,2.5,2.6),
                   peak4 = c(1.4,2.5,2.5,2.1,2.5,2.6),
                   peak5 = c(1.5,2.5,2.46,2.1,2.5,2.6),
                   peak6 = c(1.6,2.5,2.4,2.1,2.5,2.6),
                   peak7 = c(1.7,2.5,2.4,2.1,2.5,2.0),
                   peak8 = c(1.8,2.5,2.4,2.1,2.5,2.1),
                   peak9 = c(1.9,2.2,2.4,2.1,2.5,2.2),
                   peak10 = c(2,2.5,2.4,2.1,2.5,2.3),
                   peak11 = c(2.1,2.5,2.4,2.1,2.5,2.4),
                   peak12 = c(2.2,2.5,2.4,2.99,3,2.5))

I've tried adding an index column and then select the value using a lapply function, but it returns a list in each cell and runs very slow on the real dataset, which has ~3.000.000 records. Ideally, i'm looking for a solution that solves this in a few seconds because this runs in shiny.

data$index <- lapply(split(data[,c(-1)],seq(nrow(data))),FUN = order, decreasing = TRUE)
rank <- 4
data$result <- lapply(1:nrow(data), function(row) data[row, data$test[[row]][rank]+1])

Martin Morgan · Answer 1 · 2019-10-24T13:09:41.673

I've updated my answer to provide three solutions; fun2() is in retrospect the best (fastest, most robust, easy to understand) answer.

There are various StackOverflow posts for finding n-th highest values, e.g., https://stackoverflow.com/a/2453619/547331 . Here's a function to implement that solution

nth <- function(x, nth_largest) {
    n <- length(x) - (nth_largest - 1L)
    sort(x, partial=n)[n]
}

Apply this to each (numerical) row of your data.frame

data$nth <- apply(data[,-1], 1, nth, nth_largest = 4)

I made a large data set

for (i in 1:20) data = rbind(data, data)

and then did some basic timing

> system.time(apply(head(data[,-1], 1000), 1, nth, 4))
   user  system elapsed
  0.012   0.000   0.012
> system.time(apply(head(data[,-1], 10000), 1, nth, 4))
   user  system elapsed
  0.150   0.005   0.155
> system.time(apply(head(data[,-1], 100000), 1, nth, 4))
   user  system elapsed
  1.274   0.005   1.279
> system.time(apply(head(data[,-1], 1000000), 1, nth, 4))
   user  system elapsed
 14.847   0.095  14.943

So it scales linearly with number of rows (not surprising...), at about 15s per million rows.

For comparison, I wrote this solution as

fun0 <-
    function(df, nth_largest)
{
    n <- ncol(df) - (nth_largest - 1L)
    nth <- function(x)
        sort(x, partial=n)[n]
    apply(df, 1, nth)
}

used as fun0(data[,-1], 4).

A different strategy is to create a matrix from the numerical data

m <- as.matrix(data[,-1])

then to order the entire matrix, placing the row indexes of the values into order

o <- order(m)
i <- row(m)[o]

Then for the largest, next largest, ... values, set the last value of each row index to NA; the nth largest value is then the last occurrence of the row index

for (iter in seq_len(nth_largest - 1L))
    i[!duplicated(i, fromLast = TRUE)] <- NA_integer_
idx <- !is.na(i) & !duplicated(i, fromLast = TRUE)

The corresponding values are m[o[idx]], placed in row-order with

m[o[idx]][order(i[idx])]

Thus an alternative solution is

fun1 <-
    function(df, nth_largest)
{
    m <- as.matrix(df)
    o <- order(m)
    i <- row(m)[o]

    for (idx in seq_len(nth_largest - 1L))
        i[!duplicated(i, fromLast = TRUE)] <- NA_integer_
    idx <- !is.na(i) & !duplicated(i, fromLast = TRUE)

    m[o[idx]][order(i[idx])]
}

We have

> system.time(res0 <- fun0(head(data[,-1], 1000000), 4))
   user  system elapsed 
 17.604   0.075  17.680 
> system.time(res1 <- fun1(head(data[,-1], 1000000), 4))
   user  system elapsed 
  3.036   0.393   3.429 
> identical(unname(res0), res1)
[1] TRUE

Generally, it seems like fun1() will be faster when nth_largest is not too large.

For fun2(), order the original data by row and then value, and keep only the relevant indexes

fun2 <-
    function(df, nth_largest)
{
    m <- as.matrix(df)
    o <- order(row(m), m)
    idx <- seq(ncol(m) - (nth_largest - 1), by = ncol(m), length.out = nrow(m))
    m[o[idx]]
}

With

> system.time(res1 <- fun1(head(data[, -1], 1000000), 4))
   user  system elapsed 
  2.948   0.406   3.355 
> system.time(res2 <- fun2(head(data[, -1], 1000000), 4))
   user  system elapsed 
  0.316   0.062   0.379 
> identical(res1, res2)
[1] TRUE

Profiling fun2() on the full data set

> dim(data)
[1] 6291456      13
> Rprof(); res2 <- fun2(data[, -1], 4); Rprof(NULL); summaryRprof()
$by.self
              self.time self.pct total.time total.pct
"order"            1.50    63.56       1.84     77.97
"unlist"           0.36    15.25       0.36     15.25
"row"              0.34    14.41       0.34     14.41
"fun2"             0.10     4.24       2.36    100.00
"seq.default"      0.06     2.54       0.06      2.54
...

shows that most of the time is spent in order(); I'm not completely sure how order() on multiple factors is implemented, but it perhaps has the complexity associated with radix sort. Whatever the case, it's pretty fast!

fwiw, `fun2` is faster than using `Rfast::nth` on each row. and can i presume that the complexity is bounded by radix sort's O(wn)? — chinsoon12, Oct 24 '19 at 10:22
@chinsoon12 I added a little profiling information for `fun2()` and changed my answer so that I don't make an explicit statement about complexity. — Martin Morgan, Oct 24 '19 at 13:10
`m[order(row(m), m)]` is actually an incredible fast way to sort each row of the matrix without any external packages. can i have your permission to post this solution in https://stackoverflow.com/questions/9506442/fastest-way-to-sort-each-row-of-a-large-matrix-in-r and https://stackoverflow.com/questions/6063881/sorting-rows-alphabetically and then reference solutions back to here? — chinsoon12, Oct 25 '19 at 01:05
@chinsoon12 sure! it's really pretty amazing to see how flexible that is; I guess one needs `matrix(m[order(row(m), m)], nrow(m), byrow=TRUE)` for row-wise sorting, but only `m[] = m[order(col(m), m)]` for column-wise sorting. — Martin Morgan, Oct 25 '19 at 11:00

Fastest way to select i-th highest value from row and assign to new column

1 Answers1

Linked