0

I got a list object with 4983 rows and 369 columns. Every column is a different sample and every row is one value of this sample.

Now I need to extract the 100 samples that have the highest variance in its rows, but I have no idea how to do this ..

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Mark Wekking
  • 391
  • 1
  • 5
  • 14
  • 4
    what have you tried so far? can you post data ? – Mike Nov 11 '19 at 14:25
  • 1
    Can you help us help you by producing a reproducible example ? You can find all the means to do so [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – cbo Nov 11 '19 at 14:43
  • try something like `sort(sapply(df[,target_col], var), decreasing=TRUE)[1:100]` – slava-kohut Nov 11 '19 at 14:45

2 Answers2

4

Example using only 20 rows and 5 columns, returning the two columns that have the highest variability:

# some example data:
dat <- data.frame(var1 = rnorm(n=20, mean = 1, sd=4),
                  var2 = rnorm(n=20, mean = 1, sd=3),
                  var3 = rnorm(n=20, mean = 1, sd=2),
                  var4 = rnorm(n=20, mean = 1, sd=8),
                  var5 = rnorm(n=20, mean = 1, sd=6))
head(dat)

# calculate variance per column
variances <- apply(X=dat, MARGIN=2, FUN=var)

# sort variance, grab index of the first 2
sorted <- sort(variances, decreasing=TRUE, index.return=TRUE)$ix[1:2] # replace 2 with 100 ...

# use that to subset the original data
dat.highvariance <- dat[, sorted]
dat.highvariance
Where's my towel
  • 561
  • 2
  • 12
1

My code is exactly the same as "Where's my towel" but faster using package Rfast

# some example data:
dat <- data.frame(var1 = rnorm(n=20, mean = 1, sd=4),
              var2 = rnorm(n=20, mean = 1, sd=3),
              var3 = rnorm(n=20, mean = 1, sd=2),
              var4 = rnorm(n=20, mean = 1, sd=8),
              var5 = rnorm(n=20, mean = 1, sd=6))

MaxVars_R<-function(dat,n){
    head(dat)

    # calculate variance per column
    variances <- apply(X=dat, MARGIN=2, FUN=var)

    # sort variance, grab index of the first 2
    sorted <- sort(variances, decreasing=TRUE, index.return=TRUE)$ix[1:n]

    # use that to subset the original data
    dat.highvariance <- dat[, sorted]
    dat.highvariance
}

MaxVars<-function(dat,n,parallel = FALSE){
    x<-Rfast::data.frame.to_matrix(dat)
    variances<-Rfast::colVars(x,parallel = parallel)
    indices<-Rfast::Order(variances,descending = TRUE,partial = n)[1:n]
    dat[,indices]
}

all.equal(MaxVars(dat,2),MaxVars_R(dat,2))
Manos Papadakis
  • 564
  • 5
  • 17