0

I have a for loop inside a function, and it runs fine for data frames that have <10000 rows, but the time used for the loop increases exponentially as the number of rows increase. I have read this post about optimizing loops. Although, I don't know how to apply it to my situation

Here is the for loop below:

for (i in 1:nrow(data.frame)) {
    event <- as.character(data.frame[i,"Event"])
    if(i < 20) {
        # do nothing
    }
    else {
        # Get the previous 20 rows
        one.sec.interval = data[(i - (20 - 1)):i,]
        #       print(head(one.sec.interval))

        # get the covariance matrix
        cov.matrix <- var(one.sec.interval)

        # get the variance of the features
        variance.of.features <- diag(cov.matrix)

        # reformat the variance vector into data frame for easier manipulation
        variance.of.features <- matrix(variance.of.features,1,length(variance.of.features))
        variance.of.features <- data.frame(variance.of.features)

        # rename the variance column of the features
        colnames(variance.of.features) <- c('Back.Pelvis.Acc.X.sd', 'Back.Pelvis.Acc.Y.sd', 'Back.Pelvis.Acc.Z.sd',
        'Back.Pelvis.Gyro.X.sd', 'Back.Pelvis.Gyro.Y.sd', 'Back.Pelvis.Gyro.Z.sd',
        'Back.Trunk.Acc.X.sd', 'Back.Trunk.Acc.Y.sd', 'Back.Trunk.Acc.Z.sd',
        'Back.Trunk.Gyro.X.sd', 'Back.Trunk.Gyro.Y.sd', 'Back.Trunk.Gyro.Z.sd')

        # create the new feature vector
        new.feature.vector <- cbind(data[i,], variance.of.features)
        new.feature.vector$Event <- event
        one.sec.interval.data[i- (20 - 1),] <- new.feature.vector
    }
}
Community
  • 1
  • 1
YellowPillow
  • 4,100
  • 6
  • 31
  • 57
  • 2
    It won't do much in terms of performance but for readability you can start by changing the iteration sequence to `21:nrow(data.frame)`. This will allow you to remove the `if` statement (since your loop is not doing anything when `i < 20`). – seasmith Oct 31 '16 at 15:31
  • 1
    I would also suggest using matrices instead of data frames. They are much faster when you subset rows. – Andrey Shabalin Oct 31 '16 at 15:53
  • Thanks I will try that – YellowPillow Oct 31 '16 at 15:57

1 Answers1

0

if you want to use matrices that could work too. Alternatively:

step 1: definitely use data.table package; it's supremely fast. step 2: set event as character prior to the loop. step 3: don't have if-loops if possible. In this case, just set i to go from 20 to the full list instead of check to see if it's lower than 20:

library(data.table)
data.table$Event <- as.character(data.table$event)
for (i in 20:nrow(data.table)) {
   ...do stuff...
}

step 4: you can setup the data.table columns ahead of time and rename them at the end.

data.table <- data.table("Col Name 1"=character(),"Col Name 2"=numeric()...etc)

step 5: this may not be possible depending on how your data is structured. But you can also use parallel processing using the doMC package and a foreach loop. This requires that each run does not depend on the data from any other run however. Not sure if that's applicable to you or not.

--Hope this helps!

mjfred
  • 54
  • 3
  • Well you can use as.data.table() instead. But using data.table by itself isn't necessarily faster. In particular, it allows you to subset very quickly. ex: setkey(df, "name of column name you want to subset by") new.subset <- df[J("value to subset by"), nomatch=0L] – mjfred Oct 31 '16 at 16:53
  • Again it depends on your data but you can parallelize it: a <- foreach – mjfred Oct 31 '16 at 16:54