2

I have data about machines in the following form Number of rows - 900k

Data

          A   B   C   D   E   F   G   H   I   J   K   L   M   N
         ---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
     1    1   1   1   1   1   1   1   1   1   0   1   1   0   0
     2    0   0   0   0   1   1   1   0   1   1   0   0   1   0
     3    0   0   0   0   0   0   0   1   1   1   1   1   0   0

1 indicates that the machine was active and 0 indicates that it was inactive.

I want my output to look like

          A   B   C   D   E   F   G   H   I   J   K   L   M   N
         ---- -- --- ---- --- --- --- --- --- --- --- --- --- ---
     1    1   1   1   1   1   1   1   1   1   1   1   1   0   0
     2    0   0   0   0   1   1   1   1   1   1   0   0   1   0
     3    0   0   0   0   0   0   0   1   1   1   1   1   0   0

Basically all I am trying to do is look for zeros in a particular row and if that zero is surrounded by ones on either side, replace 0 with 1

example -

in row 1 you have zero in column J but you also have 1 in column I and K which means I replace that 0 by 1 because it is surrounded by 1s

The code I am using is this

  for(j in 2:13) {
    if(data[i,j]==0 && data[i,j-1]==1 && data[i,j+1]==1){
      data[i,j] = 1
    }
  }
}

Is there a way to reduce the time computation for this? This takes me almost 30 mins to run in R. Any help would be appreciated.

Jeet
  • 188
  • 12
  • @d.b Ohh yeah! Thanks. Updating the post. Appreciate it man – Jeet Oct 02 '19 at 19:48
  • Possible duplicate of [Speed up the loop operation in R](https://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r) – yusuzech Oct 02 '19 at 20:12
  • hmm i get around 2s to run your code for a 900,000 by 14 matrix – chinsoon12 Oct 03 '19 at 00:44
  • @H1 `set.seed(0L); data <- matrix(sample(0:1, 900e3*14, TRUE), ncol=14); for (i in 1:900e3) { for(j in 2:13) { if(data[i,j]==0 && data[i,j-1]==1 && data[i,j+1]==1) { data[i,j] = 1 } } }` on R-3.6.1 Win7 x64 – chinsoon12 Oct 03 '19 at 02:03
  • @yifyan I saw that but the fact that I was using multiple for loops made me repost this! – Jeet Oct 03 '19 at 13:30

4 Answers4

3

this is faster because it does not require to iterate through the rows.

for(j in 2:13) {
  data[,j] = ifelse(data[,j-1] * data[,j+1]==1,1,data[,j])
  }

or a littlebit more optimized, without using ifelse

for(j in 2:(ncol(data) - 1)) {
  data[data[, j - 1] * data[, j + 1] == 1, j] <- 1
  }
Wietze314
  • 5,942
  • 2
  • 21
  • 40
2

Cut the time by using vectorized operations. As you are planning to do the same thing for every row, this can be done by utilizing the vectorized conditional statements.

for(i in seq(ncol(data) - 2) + 1){ #<== all but last and first column 
    #Find all neighbouring columns that are equal, where the the center column is equal to 0 
    condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
    #Overwrite only the values that holds the condition
    data[which(condition), i] <- 1
}
Oliver
  • 8,169
  • 3
  • 15
  • 37
2

You could also use gsub to replace any instances of 101 with 111 using the following code:

collapsed <- gsub('101', '111', apply(df1, 1, paste, collapse = ''))
data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
names(data) <- LETTERS[1:numLetters]

Here's a comparison of everyone's solutions:

library(data.table)
library(rbenchmark)
library(tidyverse)
set.seed(1)
numLetters <- 13
df <- as_tibble(matrix(round(runif(numLetters * 100)), ncol = numLetters))
names(df) <- LETTERS[1:numLetters]
benchmark(
  'gsub' = {
    data <- df
    collapsed <- gsub('101', '111', apply(data, 1, paste, collapse = ''))
    data <- as_tibble(t(matrix(unlist(sapply(collapsed, strsplit, split = '')), nrow = numLetters)))
    names(data) <- LETTERS[1:numLetters]
  },
  'for_orig' = {
    data <- df
    for(i in 1:nrow(data)) {
      for(j in 2:(ncol(data) - 1)) {
        if(data[i, j] == 0 && data[i, j - 1] == 1 && data[i, j + 1] == 1) {
          data[i, j] = 1
        }
      }
    }
  },
  'for_norows' = {
    data <- df
    for(j in 2:(ncol(data) - 1)) {
      data[, j] = ifelse(data[, j - 1] * data[, j + 1] == 1, 1, data[, j])
    }
  },
  'vectorize' = {
    data <- df
    for(i in seq(ncol(data) - 2) + 1) {
      condition <- data[, i - 1] == data[, i + 1] & data[, i - 1] == 1 & data[, i] == 0
      data[which(condition), i] <- 1
    }
  },
  'index' = {
    data <- df
    idx <- apply(data, 1, function(x) c(0, diff(x)))
    data[which(idx == -1 & lead(idx == 1), arr.ind = TRUE)[, 2:1]] <- 1
  },
  replications = 100
)

The indexing solution (which has since been deleted) wins hands-down in terms of computational time for a 13-by-100 data frame.

        test replications elapsed relative user.self sys.self user.child
3 for_norows          100    1.19    7.438      1.19        0         NA
2   for_orig          100    9.29   58.063      9.27        0         NA
1       gsub          100    0.28    1.750      0.28        0         NA
5      index          100    0.16    1.000      0.16        0         NA
4  vectorize          100    0.87    5.438      0.87        0         NA
  sys.child
3        NA
2        NA
1        NA
5        NA
4        NA
Benjamin Ye
  • 508
  • 2
  • 7
2

You can avoid loops altogether and use indexing to replace all the values at once:

  nc <- ncol(df)
  df[, 2:(nc - 1)][df[, 1:(nc - 2)] * df[, 3:nc] == 1] <- 1
Ritchie Sacramento
  • 29,890
  • 4
  • 48
  • 56