1

I have a very large data.frame. What I am trying to do is subtract the row mean of columns 37-2574 from those columns, then divide by the row standard deviation. I then need to multiply columns 1-18 by the (same row) standard deviation. Finally, I need to subtract the row mean from columns 18-2574 from columns 19-36. I'm currently trying to do this via a for loop, but it is taking forever. Is there a way to do this with apply, or even a faster for loop? Here's what I have currently:

for (i in 1:nrow(samples)){
  theta.mean <- mean(samples[i, 37:2574])
  theta.sd <- sd(samples[i, 37:2574])
  samples[i, 37:2574] <- (samples[i, 37:2574] - theta.mean)/ theta.sd
  # then multiply columns 1-18 by SD of theta at each iteration 
  samples[i, 1:18] <- samples[i, 1:18] * theta.sd
  # subtract theta-mean * column 1-18 from columns 19-36
  for (j in 1:18){
    theta.mean.beta <- theta.mean * samples[i, j]
    samples[i, j + 18] <- samples[i, j + 18] - theta.mean.beta
  }
}
Alex
  • 1,997
  • 1
  • 15
  • 32
  • are you sure it shouldnt supposed to be samples[i, 37:2574] <- (samples[i, 37:2574] - theta.mean[i])/ theta.sd[i] ? – ECII May 03 '15 at 18:32
  • and similarly samples[i, 1:18] <- samples[i, 1:18] * theta.sd[i] ? It would make more sense if I understand you correctly – ECII May 03 '15 at 18:33
  • @ECII - since I'm not storing theta.mean or theta.sd, I just write over them at each iteration of the loop. – Alex May 03 '15 at 18:33

1 Answers1

5

The trick is to use apply() to calculate all the row statistics all at once and then to do the operations column-wise like like so:

# calculate the row means and sds's using apply()
theta.means  <-  apply(samples[,37:2574],  # the object to summarized
                       1,                  # summarize over the rows (MARGIN = 1)
                       mean)               # the summary function 
theta.sds  <-  apply(samples[,37:2574],1,sd)

# define a function to apply for each row
standardize  <-  function(x)
    (x - mean(x))/sd(x)
# apply it it over for each row (MARGIN = 1)
samples[,37:2574]  <-  t(apply(samples[,37:2574],1,standardize))

# subtract theta-mean * column 1-18 from columns 19-36
for (j in 1:18){
    samples[, j] <- samples[,j] * theta.sds
    theta.mean.beta <- theta.means * samples[, j]
    samples[, j + 18] <- samples[, j + 18] - theta.mean.beta
}

Be sure and double check that this code is equivalent to your original code by taking a subset of rows (e.g. 'samples <- samples[1:100,]`) and checking that the results are the same (I would have done this my self, but there wasn't an example dataset posted...).


UPDATE:

Here's a more efficient implementation based on David Arenburg's comments below:

# calculate the row means via rowMeans()
theta.means  <-  rowMeans(as.matrix(samples[,37:2574]))

# redefine SD to be vectorized with respect to rows in the data.frame 
rowSD <- function(x)  
    sqrt(rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)) 

# calculate the row means and sds's using the vectorized version of SD
theta.sds  <-  rowSD(as.matrix(samples[,37:2574]))

Now use the fact when you subtract a vector (x) from a data.frame (df), R recycles the values of x -- and when lengh(x) == nrow(df) the result is the same as subtracting x from each column of df:

 # standardize columns 37 through 2574
 samples[,37:2574] <-  (samples[,37:2574] - theta.means)/theta.sds

Now do similar calculations for rows 1:18 and 19:36

# subtract theta-mean * column 1-18 from columns 19-36
samples[, 1:18] <- samples[,1:18] * theta.sds
samples[, 1:18 + 18] <- samples[, 1:18 + 18] - theta.means * samples[,1:18] * theta.sds
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Jthorpe
  • 9,756
  • 2
  • 49
  • 64
  • Nice! Does the use of `apply` here over more manually-written loopy code actually speed things up with a side-by-side comparison? I'm afraid I lack a copy of R (haven't used it since my university), but knowing that a 1-to-1 translation of a for loop into `apply` actually offers a speed boost would be very useful info. –  May 03 '15 at 18:54
  • 1
    Thanks, that was great. @Ike - yes, this code is much faster than the original. The original loop never actually finished (ran for over 1 hour), and this code finished in <1 minute. – Alex May 03 '15 at 18:58
  • @Alex Ah yes, this code is faster but it's also smarter algorithmically. I'm mainly curious about those one-to-one side comparisons, as I may have a very incorrect notion of `apply` (I always thought of it as a shorthand rather than an optimization). –  May 03 '15 at 18:58
  • 2
    It should be faster. If you think about `df[i,j] <- 1`, R actually replaces the entire vector `df[,j]` (or equivalently `df[[j]]`) during each iteration of the for loop, not just it's i'th element. Using `apply()` makes just one replacement of that vector. For more on performance in R, see Hadley Wickham's [Advanced R](http://adv-r.had.co.nz/Performance.html) – Jthorpe May 03 '15 at 19:01
  • Very nice, and apologies, this info makes my previous answer completely wrong (I really thought 'apply' was a shorthand at best). I removed it in favor of your obviously superior one. –  May 03 '15 at 19:02
  • 1
    Why are you calculating row means with `apply`? There is a bult in function for that called `rowMeans`. Row wise SD can be easily vectorized too. `(x - mean(x))/sd(x)` can also be vectorizd. My guess is that the data isn't really big, otherwise this whole `apply` with margin of 1 ans-amble would never end. – David Arenburg May 03 '15 at 19:41
  • Also what's `samples[,37:2574] <- t(apply(samples[,37:2574],1,standardize))` for? Isn't it just `(samples[,37:2574] - theta.means)/theta.sds`? – David Arenburg May 03 '15 at 19:53
  • Re: useing `apply` over `rowMeans` the only answer I've got (which is not a great answer) is habit. `(x - mean(x))/sd(x)` is vectorized, but I'm curious how you would apply it row wise without using the apply family. – Jthorpe May 03 '15 at 19:53
  • 1
    `(x - mean(x))/sd(x)` isn't vectorized when used within `apply` with a margin of one. You know `apply` is just a `for` loop right? Here's an interesting read http://stackoverflow.com/questions/28983292/is-the-apply-family-really-not-vectorized. I also don't understand why you couldn't just do `(samples[,37:2574] - theta.means)/theta.sds`? Isn't it the same? What's `apply` and `t` for? – David Arenburg May 03 '15 at 19:54
  • Re vectorizing `sd` over rows, use this `RowSD <- function(x) { sqrt(rowSums((x - rowMeans(x))^2)/(dim(x)[2] - 1)) }` and then simply `RowSD(as.matrix(samples[,37:2574]))`. – David Arenburg May 03 '15 at 19:56
  • Yes, that makes sense, and I would absolutely do something like that it apply(df,1,sd) took longer than a few minutes to run, but the OP was asking about `apply` specifically. – Jthorpe May 03 '15 at 20:01
  • Thanks for the link to your post on apply...I agree that apply is just a for loop, but it doesn't do the element-wise asignment within a for loop, but rather there is just one assignment to the data.frame object. I assumed that was the reason that the `apply` was faster -- not that the calculations themselves were actually faster. – Jthorpe May 03 '15 at 20:03
  • 1
    My moto is not to do what I told but what's right ;). `apply` with a margin of one should be avoided by all cost usually because R is vectorised language and wasn't designed to work efficiently on rows. Not to mention if you have all the vectorized functions at your disposal. Lastly, `(samples[,37:2574] - theta.means)/theta.sds` is a great lesson for the OP what vectorized means and why we don't need `for` loops (or `apply`s- which is the same). Cheers. – David Arenburg May 03 '15 at 20:06
  • Thanks for this. This clarified for me that "vectorized" really means "vectorized with respect to existing data structure", which in this case is a data.frame (a.k.a. list) of column vectors, and it's the existing (column) vectors to which apply(df,1, foo) is not vectorized. – Jthorpe May 03 '15 at 20:17
  • Vectorized basically means that it doesn't need to evaluate some function in each loop (doesn't matter if R or C). `rowMeans` is also vectorized because it calculates the means within the C code without calling `mean` from R in each loop unlike `apply` does. Though the definition of `vectorized` is somewhat arguable (it appears) as you can see by the various answers on that question. – David Arenburg May 03 '15 at 20:37
  • Just thinking out loud here, but it sounds like `vectorized` means a few things, including avoiding unnecessary function calls ( and their overhead ), and avoiding repeatedly allocating memory (even within the memory already allocated to R by the operating system), and probably something else, by taking advantage of R's functions that pre-allocate and/or re-use memory based on the 'shape' of the arguments. – Jthorpe May 03 '15 at 21:03