-2

I have the following code, which reads through the table's column and if the element contains the correct string, it increments a corresponding value in another vector. Here is the code:

dateArray <- integer(365)

for (i in 189500:207097) {
    if (grepl("Jan", csvaryana[i, "Date"], ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)) {
        for (j in 1:31) {
            if (j < 10) {
                if (grepl(paste(sprintf(" 0%d", j), ""), csvaryana[i, "Date"], ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE))
                    dateArray[j] <- dateArray[j] + 1
                }
            if (grepl(paste(sprintf(" %d", j), ""), csvaryana[i, "Date"], ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE))
                dateArray[j] <- dateArray[j] + 1
        }
    }
}

dateArray

Note that csvaryana is a table with 207,097 rows. The code is supposed to check all rows, but I cut this down to only about 10,000 rows. It takes a few minutes to run this, and much longer for the full code. How can I do this same thing much more quickly? I have heard that for loops are not very efficient.

Hayk Khulyan
  • 5
  • 1
  • 4
  • 2
    Please show a small example of your data, we need to see how `Date` is formatted. This can probably be done much, much faster. – Marius Feb 20 '18 at 00:24
  • It's very likely it can be made vastly faster but unless we understand the data, we will not be able to give the best advice. – IRTFM Feb 20 '18 at 01:18
  • `dateArray[as.numeric(format(strptime(grep("Jan",csvaryana[, "Date"],T,value = T),"%b %d"),"%d"))]=1` or even `dataArray[as.numeric(gsub("\\D",grep("Jan",csvaryana$Date,T,value = T)))]=1` – Onyambu Feb 20 '18 at 01:52
  • `dateArray[as.numeric(gsub("\\D","",grep("Jan",csvaryana$Date,T,value = T)))]=1` – Onyambu Feb 20 '18 at 02:11
  • I gues there should be no for loop here. You can try the above – Onyambu Feb 20 '18 at 02:20

4 Answers4

1

for loop in R are very slow as other answers explained.
You can read this article if you want to speed up your loop: Strategies to Speed-up R Code

According to the article, you can do the following steps:

  1. Use ifesle instead of if
  2. Take statements that check for conditions (if statements) outside the loop
  3. Run the loop only for True conditions
  4. Using which()
  5. Use apply family of functions instead of for-loops
  6. Use parallel processing if you have a multicore machine
  7. Use data structures that consume less memory
  8. If you know C++, then the best way is to use Rcpp, which runs C++ code.
yusuzech
  • 5,896
  • 1
  • 18
  • 33
1

This is hard to do without a running example you could start by making each element of your loop into a function. Let's number the lines as follow:

#1  for (i in 189500:207097) {
#2      if (grepl("Jan", csvaryana[i, "Date"], ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)) {
#3          for (j in 1:31) {
#4              if (j < 10) {
#5                  if (grepl(paste(sprintf(" 0%d", j), ""), csvaryana[i, "Date"], ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE))
#6                      dateArray[j] <- dateArray[j] + 1
#7                  }
#8              if (grepl(paste(sprintf(" %d", j), ""), csvaryana[i, "Date"], ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE))
#9                  dateArray[j] <- dateArray[j] + 1
#10         }
#11     }
#12 }

You can then wrap up your grepl as a function (this is probably more aesthetic than time saving):

## The grepl function (lines 2, 5 and 8)
grepl.ifelse <- function(i, pattern, data) {
    grepl(pattern, data[i, "Date"], ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE)
}

For the other parts of the loop, we can use sapply functions that pass the values of a vector to a function. Since we need simply to update dateArray we can use the <<- assignment that assigns values out of the function environment (see ?"<<-" for more info):

## Update dateArray function (lines 4 to 10)
update.dateArray <- function(j, i, dateArray, csvaryana) {
    if (j < 10) {
        if (grepl.ifelse(paste(sprintf(" 0%d", j), ""), i, csvaryana)) {
            dateArray[j] <<- dateArray[j] + 1
        }
    } else {
        if (grepl.ifelse(paste(sprintf(" %d", j), ""), i, csvaryana)){
            dateArray[j] <<- dateArray[j] + 1
        }
    }
}

This function will thus update dateArray outside the function (no need to return). We can apply the same principle to the bigger loop (i):

## Checking the month of January (lines 2 to 11)
check.jan <- function(i, dateArray, csvaryana) {
    if(grepl.ifelse("Jan", i, csvaryana)) {
        ## Update dateArray out of the function
        dateArray <<- sapply(1:31, update.dateArray, i, dateArray, csvaryana)
    }

    return(dateArray)
}

Again, it's hard to test without a running example so this post might need some edits but this is what it could look like:

dateArray <- integer(365)

## Running the whole loop
sapply(189500:207097, check.jan, dateArray, csvaryana)
## Updated dateArray
dateArray
Thomas Guillerme
  • 1,747
  • 4
  • 16
  • 23
  • The original code just seems to count instances of each date in January - so the fastest and simplest solution is probably to convert the date column into a proper `Date` datatype and use the existing tools for dates (maybe with `lubridate`). Doing it manually like this is over-complicating things. – Marius Feb 20 '18 at 00:52
0

I solved it without a loop. I separated the Date column into Year, Month and Day, and Time, so I just called count(Month and Day) and it returned a vector with the frequencies of each Month and Day:

dateVector <- count(outfile, "X2")
Hayk Khulyan
  • 5
  • 1
  • 4
-1

Don't use for loops that's how. There is a very good SO post discussing this: Why are loops slow in R?

The answer is to vectorise your dataframe to speed up the process (via apply family or checkout purrr. Clean your data and code so sprintf(" 0%d", j) is not computed during the loop and consider a replacement for grepl as it does seem overkill in this case.

A good blog post discussing some of these concepts: https://robinsones.github.io/Making-R-Code-Faster-A-Case-Study/

Amar
  • 1,340
  • 1
  • 8
  • 20