2

I have a data frame with 22239 rows & 200 columns. The first column - NAME - is a character and the other columns are numeric. My goal is to operate on all elements of rows by:

  • Finding the rows' median;
  • Subtracting the median from the row element (value);
  • Finding the rows` median absolute deviation (mad);
  • Dividing rows elements by rows mad.

I tried this way

edata <- read.delim("a.txt", header=TRUE, sep="\t")

## Converting dataframe into Matrix
## Taking all rows but starting from 2 column to 200
data <- as.matrix(edata[,2:200]) 
for(i in 1:22239){  #rows below columns
    for(j in 1:200) {
        m <- median(data[i,]) # median of rows
        md <- mad(normdata[i,]) # mad of rows
        a <- data[i,j]  # assigning matrix element value to a
        subs = a-m    # substracting
        escore <- subs/md  # final score
        data[i,j] <- escore  # assigning final score to row elements

After getting new values for every elements of the rows I want to sort it according to the 75% quantiles on the basis of the NAME column. But, I am not sure how to do this.

I know my code isn't memory efficient. When I run the above code, the looping is very slow. Tried foreach, but couldn't succeed it. Can you guys suggest me the good way to deal with these kind of problems?

csgillespie
  • 59,189
  • 14
  • 150
  • 185
thchand
  • 358
  • 2
  • 8
  • 20
  • 1
    I would suggest you make a new question for your sorting problem. Try to follow the MRE guidelines posted here: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Roman Luštrik May 27 '11 at 15:10

5 Answers5

3

This is an ideal job for sweep().

set.seed(47)
dat <- matrix(rnorm(22239 * 200), ncol = 200)
rmeds <- apply(dat, 1, median)     ## row medians
rmads <- apply(dat, 1, mad)        ## row mads
dat2 <- sweep(dat, 1, rmeds, "-")  ## sweep out the medians
dat2 <- sweep(dat2, 1, rmads, "/") ## sweep out the mads

This can be speeded up a bit by not using mad() as it computes the medians again:

rmeds <- apply(dat, 1, median)     ## row medians
dat3 <- sweep(dat, 1, rmeds, "-")  ## sweep out the medians
rmads <- 1.4826 * apply(abs(dat3), 1, median)        ## row mads
dat3 <- sweep(dat3, 1, rmads, "/") ## sweep out the mads

R> all.equal(dat2, dat3)
[1] TRUE

Notice that R's mad() multiplies by a constant 1.4826 to achieve asymptotically normal consistency, hence the extra bit in the second example.

Some timings on my system:

## first version
   user  system elapsed 
  6.215   0.183   6.412 

## second version
   user  system elapsed 
  4.365   0.167   4.535 

For @Nick's Answer I get:

## @Nick's Version
   user  system elapsed 
  5.900   0.032   5.955

which is consistently faster than my first version, but a little slower than the second version, again because the medians are being computed twice.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
2

How about this: (I created another matrix to start from, but the method is the same)

dta<-matrix(rnorm(200), nrow=20)
dta.perrow<-apply(dta, 1, function(currow){c(med=median(currow), mad=mad(currow))})
result<-(dta - dta.perrow[1,])/dta.perrow[2,]

I'm sure there are still better ways, but HTH.

Nick Sabbe
  • 11,684
  • 1
  • 43
  • 57
1

R, like matlab, is optimised for vector operations. Your for loops are probably the slowest way of achieving this. The medians of each row can be calculated using the apply function, rather than a for loop. This will gives you a column vector of medians. e.g.

apply(edata,1,median)

Similar approaches can be used for the other measures. Remember, avoiding for loops in R/matlab will generally speed up your code.

Matt
  • 425
  • 5
  • 11
1

You have special functions to deal with row data, but I like to use apply. You can think of apply as a for loop (which essentially is) working on a row at a time.

my.m <- matrix(runif(100), ncol = 5)
my.median <- apply(X = my.m, MARGIN = 1, FUN = median) #1
my.m - my.median #2
my.mad <- apply(X = my.m, MARGIN = 1, FUN = mad) #3
my.m/my.mad #4
Roman Luštrik
  • 69,533
  • 24
  • 154
  • 197
1

You can put all steps in function and use only one apply loop.

rfun <- function(x) {
         me<- median(x)
         md<-mad(x,center=me,constant=1)
         return((x-me)/md)}

dat_s <- apply(dat,1,rfun)
Wojciech Sobala
  • 7,431
  • 2
  • 21
  • 27