Multiple for loop time computation very high in R

Question

I have data about machines in the following form Number of rows - 900k

Data

          A   B   C   D   E   F   G   H   I   J   K   L   M   N
         ---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
     1    1   1   1   1   1   1   1   1   1   0   1   1   0   0
     2    0   0   0   0   1   1   1   0   1   1   0   0   1   0
     3    0   0   0   0   0   0   0   1   1   1   1   1   0   0

1 indicates that the machine was active and 0 indicates that it was inactive.

I want my output to look like

          A   B   C   D   E   F   G   H   I   J   K   L   M   N
         ---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
     1    1   1   1   1   1   1   1   1   1   1   1   1   0   0
     2    0   0   0   0   1   1   1   1   1   1   0   0   1   0
     3    0   0   0   0   0   0   0   1   1   1   1   1   0   0

Basically all I am trying to do is look for zeros in a particular row and if that zero is surrounded by ones on either side, replace 0 with 1

example -

in row 1 you have zero in column J but you also have 1 in column I and K which means I replace that 0 by 1 because it is surrounded by 1s

The code I am using is this

  for(j in 2:13) {
    if(data[i,j]==0 && data[i,j-1]==1 && data[i,j+1]==1){
      data[i,j] = 1
    }
  }
}

Is there a way to reduce the time computation for this? This takes me almost 30 mins to run in R. Any help would be appreciated.

Possible duplicate of [Speed up the loop operation in R](https://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r) — yusuzech, Oct 02 '19 at 20:12
hmm i get around 2s to run your code for a 900,000 by 14 matrix — chinsoon12, Oct 03 '19 at 00:44
@H1 `set.seed(0L); data <- matrix(sample(0:1, 900e3*14, TRUE), ncol=14); for (i in 1:900e3) { for(j in 2:13) { if(data[i,j]==0 && data[i,j-1]==1 && data[i,j+1]==1) { data[i,j] = 1 } } }` on R-3.6.1 Win7 x64 — chinsoon12, Oct 03 '19 at 02:03
@yifyan I saw that but the fact that I was using multiple for loops made me repost this! — Jeet, Oct 03 '19 at 13:30

Wietze314 · Answer 1 · 2019-10-03T11:19:43.897

3

this is faster because it does not require to iterate through the rows.

for(j in 2:13) {
  data[,j] = ifelse(data[,j-1] * data[,j+1]==1,1,data[,j])
  }

or a littlebit more optimized, without using ifelse

for(j in 2:(ncol(data) - 1)) {
  data[data[, j - 1] * data[, j + 1] == 1, j] <- 1
  }

edited Oct 03 '19 at 11:19

answered Oct 02 '19 at 20:17

Wietze314

5,942
2
21
40

score 2 · Answer 2 · answered Oct 02 '19 at 20:18

Cut the time by using vectorized operations. As you are planning to do the same thing for every row, this can be done by utilizing the vectorized conditional statements.

for(i in seq(ncol(data) - 2) + 1){ #<== all but last and first column 
    #Find all neighbouring columns that are equal, where the the center column is equal to 0 
    condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
    #Overwrite only the values that holds the condition
    data[which(condition), i] <- 1
}

Benjamin Ye · Accepted Answer · 2019-10-02T21:00:52.587

You could also use gsub to replace any instances of 101 with 111 using the following code:

collapsed <- gsub('101', '111', apply(df1, 1, paste, collapse = ''))
data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
names(data) <- LETTERS[1:numLetters]

Here's a comparison of everyone's solutions:

library(data.table)
library(rbenchmark)
library(tidyverse)
set.seed(1)
numLetters <- 13
df <- as_tibble(matrix(round(runif(numLetters * 100)), ncol = numLetters))
names(df) <- LETTERS[1:numLetters]
benchmark(
  'gsub' = {
    data <- df
    collapsed <- gsub('101', '111', apply(data, 1, paste, collapse = ''))
    data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
    names(data) <- LETTERS[1:numLetters]
  },
  'for_orig' = {
    data <- df
    for(i in 1:nrow(data)) {
      for(j in 2:(ncol(data) - 1)) {
        if(data[i, j] == 0 && data[i, j - 1] == 1 && data[i, j + 1] == 1) {
          data[i, j] = 1
        }
      }
    }
  },
  'for_norows' = {
    data <- df
    for(j in 2:(ncol(data) - 1)) {
      data[, j] = ifelse(data[, j - 1] * data[, j + 1] == 1, 1, data[, j])
    }
  },
  'vectorize' = {
    data <- df
    for(i in seq(ncol(data) - 2) + 1) {
      condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
      data[which(condition), i] <- 1
    }
  },
  'index' = {
    data <- df
    idx <- apply(data, 1, function(x) c(0, diff(x)))
    data[which(idx == -1 & lead(idx == 1), arr.ind = TRUE)[, 2:1]] <- 1
  },
  replications = 100
)

The indexing solution (which has since been deleted) wins hands-down in terms of computational time for a 13-by-100 data frame.

        test replications elapsed relative user.self sys.self user.child
3 for_norows          100    1.19    7.438      1.19        0         NA
2   for_orig          100    9.29   58.063      9.27        0         NA
1       gsub          100    0.28    1.750      0.28        0         NA
5      index          100    0.16    1.000      0.16        0         NA
4  vectorize          100    0.87    5.438      0.87        0         NA
  sys.child
3        NA
2        NA
1        NA
5        NA
4        NA

It might be worth to benchmark this on larger datasets, since not all methods scale well. — Wietze314, Oct 03 '19 at 11:09

score 2 · Answer 4 · answered Oct 02 '19 at 23:35

2

You can avoid loops altogether and use indexing to replace all the values at once:

  nc <- ncol(df)
  df[, 2:(nc - 1)][df[, 1:(nc - 2)] * df[, 3:nc] == 1] <- 1

answered Oct 02 '19 at 23:35

Ritchie Sacramento

29,890
4
48
56

Multiple for loop time computation very high in R

4 Answers4