0

I have written a for loop in my code

for(i in 2:nrow(ProductionWellYear2)) {

  if (ProductionWellYear2[i,ncol(ProductionWellYear2)] == 0) {
    ProductionWellYear2[i, ncol(ProductionWellYear2)] = ProductionWellYear2[i-1,ncol(ProductionWellYear2)] +1}


  else {ProductionWellYear2[i,ncol(ProductionWellYear2)] = ProductionWellYear2[i,ncol(ProductionWellYear2)]}


  }

However, this is very time intensive as this dataframe has over 800k rows. How can I make this quicker and avoid the for loop?

Tim Batten
  • 45
  • 7
  • Why are we starting at row 2? Are your headers included as an actual data row? – Gaffi Feb 19 '20 at 18:14
  • Row 1 is not a header but has previously been computed. So the loop is not necessary on the first row. – Tim Batten Feb 19 '20 at 18:18
  • 2
    Why are you after an apply solution? Apply may not work best for solutions where referencing a previous element is a must. – Roman Luštrik Feb 19 '20 at 18:19
  • 1
    It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. – MrFlick Feb 19 '20 at 18:31

2 Answers2

2

You could use conditional assignment, using R's potential as a vectorized language.

Consider this initial data frame:

          X1          X2         X3 year
1  1.3709584 -0.09465904 -0.1333213 2014
2 -0.5646982  2.01842371  0.6359504    0
3  0.3631284 -0.06271410 -0.2842529 2016
4  0.6328626  1.30486965 -2.6564554    0
5  0.4042683  2.28664539 -2.4404669 2018
6 -0.1061245 -1.38886070  1.3201133    0
7  1.5115220 -0.27878877 -0.3066386 2020

Then do:

num.col <- ncol(ProductionWellYear2)  # to keep code short

ProductionWellYear2[ProductionWellYear2[num.col] == 0, num.col] <- 
  ProductionWellYear2[which(ProductionWellYear2[num.col] == 0) - 1, num.col] + 1

Resulting data frame:

           X1         X2          X3 year
1 -0.16137564 -1.0344340 -2.18025447 2014
2  0.60828818  1.8149734  1.11955225 2015
3  0.02006922  1.1641742  2.08033131 2016
4 -0.70472925  0.4136222  0.95275587 2017
5  0.43061575  1.0180987 -0.26629157 2018
6 -2.49764918  0.5957401 -2.06162220 2019
7 -1.00775410  1.1497179 -0.03193637 2020

Data:

ProductionWellYear2 <- structure(list(X1 = c(1.37095844714667, -0.564698171396089, 0.363128411337339, 
0.63286260496104, 0.404268323140999, -0.106124516091484, 1.51152199743894
), X2 = c(-0.0946590384130976, 2.01842371387704, -0.062714099052421, 
1.30486965422349, 2.28664539270111, -1.38886070111234, -0.278788766817371
), X3 = c(-0.133321336393658, 0.635950398070074, -0.284252921416072, 
-2.65645542090478, -2.44046692857552, 1.32011334573019, -0.306638594078475
), year = c(2014, 0, 2016, 0, 2018, 0, 2020)), row.names = c(NA, 
-7L), class = "data.frame")
jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • 1
    Your code the one given are completely different. OP is calling back to the previous row. You are calling a different column – Onyambu Feb 19 '20 at 19:37
  • Are you sure? I've tested OP's code and my yield the same. – jay.sf Feb 19 '20 at 19:38
  • I do not see the data to test upon. Though OP has `.. + 1`. In both conditions OP is referring to the same column but a different row. In your case you refer to different columns. ie `ncol` and `ncol-1` – Onyambu Feb 19 '20 at 19:42
  • You're right! OP referred to the same column, imputing missing years by adding `+ 1` to the last know year. See edit, should work now. – jay.sf Feb 19 '20 at 19:49
  • Now that works. Well you could even make another variable for `ProductionWellYear2[num.col] == 0` as you are using it twice – Onyambu Feb 19 '20 at 19:52
  • This is a much better answer than mine, shame I focused too much on OP's question about `apply`. That being said, you may want to consider the first row should not be included in the calculation. As well, this throws an error when the first row == 0. Perhaps something like this? `num.col <- ncol(ProductionWellYear2) # to keep code short num.row <- nrow(ProductionWellYear2) # same idea ProductionWellYear2[2:num.row,][ProductionWellYear2[2:num.row, num.col] == 0, num.col] <- ProductionWellYear2[2:num.row,][which(ProductionWellYear2[2:num.row, num.col] == 0) - 1, num.col] + 1` – Gaffi Feb 20 '20 at 14:16
0

This should work for you, but without seeing your data I can't verify the results are what you want. That being said, there's really not much different here in the process as originally written, but benchmarking does seem to show it is faster with my example data, but not necessarily "fast".

library(microbenchmark)
# Create fake data
set.seed(1)
ProductionWellYear <- data.frame(A = as.integer(rnorm(2500)),
                                 B = as.integer(rnorm(2500)),
                                 C = as.integer(rnorm(2500))
)

# Copy it to confirm results of both processes are the same
ProductionWellYear2 <- ProductionWellYear


# Slightly modified original version
method1 <- function() {
  cols <- ncol(ProductionWellYear)
  for(i in 2:nrow(ProductionWellYear)) {
    if (ProductionWellYear[i, cols] == 0) {
      ProductionWellYear[i, cols] = ProductionWellYear[i - 1, cols] +1
    }
    else {
      ProductionWellYear[i, cols] = ProductionWellYear[i, cols]
    }
  }
}

# New version
method2 <- function() {
  cols <- ncol(ProductionWellYear2)
  sapply(2:nrow(ProductionWellYear2), function(i) {
    if (ProductionWellYear2[i, cols] == 0) {
      ProductionWellYear2[i, cols] <<- ProductionWellYear2[i - 1, cols] +1
    }
  })
}


# Comparing the outputs
all(ProductionWellYear == ProductionWellYear2)
#[1] TRUE

result <- microbenchmark(method1(), method2())
result
#Unit: milliseconds
#      expr      min       lq     mean   median       uq       max neval
#  method1() 151.78802 167.3932 190.14905 176.2855 197.60406 337.9904   100
#  method2()  45.56065  53.7744  67.55549  59.9299  72.81873 174.1417   100
Gaffi
  • 4,307
  • 8
  • 43
  • 73