1

I have an object named counts that is a matrix of certain counts of genes. I want to find the 5 highest values for the counts in the whole matrix. I think I know how to find the highest value but I want to find the top 5.

My code for finding the highest value is

a3 <- max(counts)

Can I modify this to find the top 5 highest values in the matrix?

  • 1
    Is `counts` a data frame, matrix, a vector or something else? Why is the `as.numeric()` needed? Typically one might `sort()` and then pick the first 5 rows, but we can't help you do that unless we can see your data. `dput(head(counts, 10))` is a nice way to share the first 10 rows/values - it is copy/pasteable and preserves class and structure information. – Gregor Thomas Dec 02 '20 at 19:34
  • It is a tsv file, a matrix, the rows are genes and columns are samples. Yes, I understood now that it is enough to write max(counts). How can I use sort? – Dora Explorer Dec 02 '20 at 19:39
  • 1
    General advice: adjust your frame of mind. Maybe you have a tsv file on your computer, but once it's loaded into R it's not a tsv file any more, it's an R object. If your import worked, it doesn't matter where it came from, what matters now is its class and structure as an R object. – Gregor Thomas Dec 02 '20 at 19:41
  • The questions seems quite related [to this question](https://stackoverflow.com/q/54303287/5861244). – Benjamin Christoffersen Dec 02 '20 at 22:38

2 Answers2

2

You could use sort followed by tail.

max_vals <- tail(sort(counts),5)
Mario Niepel
  • 1,095
  • 4
  • 19
1

You can use sort with the partial argument. Here is an example:

# simulate data
set.seed(2)
X <- matrix(rnorm(10000), 100)

# use sort with partial
x1 <- -sort(-X, partial = 5)

# gives the same as sorting the whole thing
x2 <- tail(sort(X), 5)
setdiff(x1[1:5], x2)
#R> numeric(0)

It is a bit faster for the example above:

bench::mark(
  `sort with partial` = -sort(-X, partial = 5),
  `tail + sort`       = tail(sort(X), 5), 
  min_time = 1, max_iterations = 1e6, check = FALSE)
#R> # A tibble: 2 x 13
#R>   expression          min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
#R>   <bch:expr>        <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
#R> 1 sort with partial 171µs  183µs     5281.     274KB    38.0   4592    33      870ms <NULL>
#R> 2 tail + sort       505µs  542µs     1814.     117KB     6.16  1768     6      974ms <NULL>
#R> # … with 3 more variables: memory <list>, time <list>, gc <list>

I am sure though that there is sorting method outside base R which has a partial sorting method with decreasing order as well. This should be faster because of the avoided copy.

Update

If this ever is a bottleneck for anyone then a Rcpp solution is:

#include "Rcpp.h"
#include <algorithm>
using namespace Rcpp;

inline bool rev_comp(double const i, double const j){ 
  return i > j; 
}

// [[Rcpp::export(rng = false)]]
NumericVector get_k_max(NumericVector x, unsigned const k) {
  if(k >= x.size() or k < 1)
    throw std::invalid_argument("Invalid k");
  if(k + 1 == x.size())
    return x;
  
  NumericVector out = clone(x);
  std::partial_sort(&out[0], &out[k + 1], &out[x.size() - 1], rev_comp);
  return out;
}

Using Rcpp::sourceCpp on a file with the above yields a fast solution:

# we get the same 
x3 <- get_k_max(X, 5)
setdiff(x3[1:5], x2)
#R> numeric(0)

# it is faster
bench::mark(
  Rcpp = get_k_max(X, 5),
  min_time = 1, max_iterations = 1e6, check = FALSE)
#R> # A tibble: 1 x 13
  expression    min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result
#R>   <bch:expr> <bch:> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>
#R> 1 Rcpp       19.2µs 22.9µs    42102.    78.2KB     91.9 30226    66      718ms <NULL>
#R> # … with 3 more variables: memory <list>, time <list>, gc <list>