5

I'm analyzing large sets of data using the following script:

M <- c_alignment 
c_check <- function(x){
    if (x == c_1) {
        1
    }else{
        0
    }
}
both_c_check <- function(x){
    if (x[res_1] == c_1 && x[res_2] == c_1) {
        1
    }else{
        0
    }
}
variance_function <- function(x,y){
    sqrt(x*(1-x))*sqrt(y*(1-y))
}
frames_total <- nrow(M)
cols <- ncol(M)
c_vector <- apply(M, 2, max)
freq_vector <- matrix(nrow = sum(c_vector))
co_freq_matrix <- matrix(nrow = sum(c_vector), ncol = sum(c_vector))
insertion <- 0
res_1_insertion <- 0
for (res_1 in 1:cols){
    for (c_1 in 1:conf_vector[res_1]){
        res_1_insertion <- res_1_insertion + 1
        insertion <- insertion + 1
        res_1_subset <- sapply(M[,res_1], c_check)
        freq_vector[insertion] <- sum(res_1_subset)/frames_total
        res_2_insertion <- 0
        for (res_2 in 1:cols){
            if (is.na(co_freq_matrix[res_1_insertion, res_2_insertion + 1])){
                for (c_2 in 1:max(c_vector[res_2])){
                    res_2_insertion <- res_2_insertion + 1
                    both_res_subset <- apply(M, 1, both_c_check)
                    co_freq_matrix[res_1_insertion, res_2_insertion] <- sum(both_res_subset)/frames_total
                    co_freq_matrix[res_2_insertion, res_1_insertion] <- sum(both_res_subset)/frames_total
                }
            }
        }
    }
}
covariance_matrix <- (co_freq_matrix - crossprod(t(freq_vector)))
variance_matrix <- matrix(outer(freq_vector, freq_vector, variance_function), ncol = length(freq_vector))
correlation_coefficient_matrix <- covariance_matrix/variance_matrix

A model input would be something like this:

1 2 1 4 3
1 3 4 2 1
2 3 3 3 1
1 1 2 1 2
2 3 4 4 2

What I'm calculating is the binomial covariance for each state found in M[,i] with each state found in M[,j]. Each row is the state found for that trial, and I want to see how the state of the columns co-vary.

Clarification: I'm finding the covariance of two multinomial distributions, but I'm doing it through binomial comparisons.

The input is a 4200 x 510 matrix, and the c value for each column is about 15 on average. I know for loops are terribly slow in R, but I'm not sure how I can use the apply function here. If anyone has a suggestion as to properly using apply here, I'd really appreciate it. Right now the script takes several hours. Thanks!

M--
  • 25,431
  • 8
  • 61
  • 93
Michael LeVine
  • 197
  • 1
  • 9

2 Answers2

15

I thought of writing a comment, but I have too much to say.

First of all, if you think apply goes faster, look at Is R's apply family more than syntactic sugar? . It might be, but it's far from guaranteed.

Next, please don't grow matrices as you move through your code, that slows down your code incredibly. preallocate the matrix and fill it up, that can increase your code speed more than a tenfold. You're growing different vectors and matrices through your code, that's insane (forgive me the strong speech)

Then, look at the help page of ?subset and the warning given there:

This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

Always. Use. Indices.

Further, You recalculate the same values over and over again. fre_res_2 for example is calculated for every res_2 and state_2 as many times as you have combinations of res_1 and state_1. That's just a waste of resources. Get out of your loops what you don't need to recalculate, and save it in matrices you can just access again.

Heck, now I'm at it: Please use vectorized functions. Think again and see what you can drag out of the loops : This is what I see as the core of your calculation:

cov <- (freq_both - (freq_res_1)*(freq_res_2)) /
(sqrt(freq_res_1*(1-freq_res_1))*sqrt(freq_res_2*(1-freq_res_2))) 

As I see it, you can construct a matrix freq_both, freq_res_1 and freq_res_2 and use them as input for that one line. And that will be the whole covariance matrix (don't call it cov, cov is a function). Exit loops. Enter fast code.

Given the fact I have no clue what's in c_alignment, I'm not going to rewrite your code for you, but you definitely should get rid of the C way of thinking and start thinking R.

Let this be a start: The R Inferno

Community
  • 1
  • 1
Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • 1
    I wish i could give you +2! great answer and links! – Justin Feb 16 '12 at 23:13
  • In particular note Circle 2 of 'The R Inferno'. – Patrick Burns Feb 17 '12 at 09:26
  • @PatrickBurns Honored to receive a comment from you. Let me thank you again for the great work you did on the R inferno. It helped me a lot in the past and it is still a source of inspiration for many of my students. Thanks! – Joris Meys Feb 17 '12 at 12:24
  • Thanks for all the help. I definitely tend to think more like C than R. Will work on making the changes today! Thanks again. – Michael LeVine Feb 17 '12 at 16:56
  • Updated my script. Could you check it out above and let me know if you think there is anything else I can do? – Michael LeVine Feb 22 '12 at 05:46
  • @MichaelLeVine Take a look at [How to make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to provide a sample dataset that we don't have to construct ourselves. I don't have the time to dissect your code. Also, you might want to ask this in a follow-up question, as you might confuse future users by changing the original question to which the answers were given. – Joris Meys Feb 22 '12 at 15:27
3

It's not really the 4 way nested loops but the way your code is growing memory on each iteration. That's happening 4 times where I've placed # ** on the cbind and rbind lines. Standard advice in R (and Matlab and Python) in situations like this is to allocate in advance and then fill it in. That's what the apply functions do. They allocate a list as long as the known number of results, assign each result to each slot, and then merge all the results together at the end. In your case you could just allocate the correct size matrix in advance and assign into it at those 4 points (roughly speaking). That should be as fast as the apply family, and you might find it easier to code.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224