1

Im new to R and im stuck with a problem i can't solve by myself.

A friend recommended me to use one of the apply functions, i just dont get how to use it in this case. Anyway, on to the problem! =)

Inside the inner while loop, I have an ifelse. That is the bottleneck. It takes on average 1 second to run each iteration. The slow part is marked with #slow part start/end in the code.

Given that, we will run it 2000*100 = 200000 times it will take aproximately 55.5 hours to finish each time we run this code. And the bigger problem is that this will be reused a lot. So x*55.5 hours is just not doable.

Below is a fraction of the code relevant to the question

    #dt is data.table with close to 1.5million observations of 11 variables
    #rand.mat is a 110*100 integer matrix

    j <- 1
    while(j <= 2000)
    {  
            #other code is executed here, not relevant to the question

            i <- 1
            while(i <= 100)
            {
                    #slow part start
                    t$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)
                    #slow part end

                    i <- i + 1
            }

            #other code is executed here, not relevant to the question

            j <- j + 1
    }

Please, any advice would be greatly appreciated.

EDIT - Run below code to reproduce problem

library(data.table)

dt = data.table(datecolumn=c("20121101", "20121101", "20121104", "20121104", "20121130", 
                             "20121130", "20121101", "20121101", "20121104", "20121104", "20121130", "20121130"), column2=c("5", 
                                                                                                "3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column3=c("5", 
                                                                                                                                                                  "3", "4", "6", "8", "9", "2", "4", "3", "5", "6", "8"), column4=c
                ("1", "1", "1", "1", "1", "1", "2", "2", "2", "2", "2", "2"))


unq_date <- c(20121101L, 
20121102L, 20121103L, 20121104L, 20121105L, 20121106L, 20121107L, 
20121108L, 20121109L, 20121110L, 20121111L, 20121112L, 20121113L, 
20121114L, 20121115L, 20121116L, 20121117L, 20121118L, 20121119L, 
20121120L, 20121121L, 20121122L, 20121123L, 20121124L, 20121125L, 
20121126L, 20121127L, 20121128L, 20121129L, 20121130L
)

index <- as.numeric(dt$column4)
numberOfRepititions <- 2
set.seed(131107)

rand.mat <- replicate(numberOfRepititions, sample(unq_date, numberOfRepititions))
i <- 1
while(i <= numberOfRepititions)
{       
    dt$column2 = ifelse(dt$datecolumn %in% c(rand.mat[,i]) & dt$column4==index[i], NA, dt$column2)      
    i <- i + 1
}

Notice that we wont be able to run the loop more than 2 times now unless dt grows in rows so that we have the initial 100 types of column4 (which is just an integer value 1-100)

Armen Abrami
  • 224
  • 2
  • 8
  • Loops in R, in general, are too slow. You should avoid them. The `compiler`package can improve the speed of loops with `enablejt(3)`. – user1436187 Nov 17 '13 at 09:24
  • @user1436187 I timed the loop without the dt$column3 = ifelse(dt$column4 %in% c(rand.mat[,i]) & dt$column2==index[i], NA, dt$column3) and it works really really fast. So im still pretty sure its that row and not the actual loop. Ill still look up your advice on the compiler package. – Armen Abrami Nov 17 '13 at 09:26
  • You should vectorize it or write it in c or fortran. – user1436187 Nov 17 '13 at 09:30
  • Welcome on SO! Could you give us a reproducible example (we need a small version of `rand.mat` and `dt`). Please read http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – sgibb Nov 17 '13 at 09:38
  • @user1436187 Stupid question maybe, but how do i vectorize it? Also the JIT compilation did no noticeable improvements on the speed. :/ – Armen Abrami Nov 17 '13 at 09:49
  • @sgibb Thank you! I'll see what i can do. Need a couple of minutes tho. – Armen Abrami Nov 17 '13 at 09:50
  • @sgibb Added some data so the example is reproducable (to a small extent). Notice that we wont be able to run the loop more than 2 times now unless dt grows in rows so that we have the initial 100 types of column4 (which is just an integer value 1-100). – Armen Abrami Nov 17 '13 at 10:42
  • I suppose `index` does not contain the same information as `dt$column4` in your actual data. Right? – Sven Hohenstein Nov 17 '13 at 12:11
  • @SvenHohenstein index <- as.numeric(dt$column4): Note that the values of column4 is not actually 1-100 in reality. It could be any integer value – Armen Abrami Nov 17 '13 at 12:18
  • Just for clarification: Do you really want to replace with `NA` and possibly later replace with the old value again? – Sven Hohenstein Nov 17 '13 at 12:35
  • When i replace with NA it is only done in the memory. The actual data table will have it's original value intact. – Armen Abrami Nov 17 '13 at 13:09

2 Answers2

1

Here is one proposal which is based on your small example dataset. I tried to vectorize the operations. Like in your example, numberOfRepititions represents the number of loop runs.

First, create matrices for all necessary evaluations. dt$datecolum is compared with all columns of rand.mat:

rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)

Here, dt$column4 is compared with all values of the vector index:

imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)

Both matrices are combined with logical and. Afterwards, we calculate whether there is at least one TRUE:

replace_idx <- rowSums(rmat & imat) != 0

Use the created index to replace corresponding values with NA:

is.na(dt$column2) <- replace_idx

Done.


The code in one chunk:

rmat <- apply(rand.mat[, seq(numberOfRepititions)], 2, "%in%", x = dt$datecolumn)
imat <- sapply(head(index, numberOfRepititions), "==", dt$column4)
replace_idx <- rowSums(rmat & imat) != 0
is.na(dt$column2) <- replace_idx
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • Interesting. Ill try a simulation of this to see if its faster. Hopefully it is, and ill mark your answer as correct. =) – Armen Abrami Nov 17 '13 at 12:38
0

I think you can do it in 1 line like this:

dt[which(apply(dt, 1, function(x) x[1] %in% rand.mat[,as.numeric(x[4])])),]$column3<-NA

basically the apply function works as follows by argument:

1) uses the data from "dt"

2) "1" means apply by row

3) the function passes the row as 'x', returns TRUE if your criteria are met

Troy
  • 8,581
  • 29
  • 32