Replacement of loops with if statement in R

Question

Hi what would be the best way of doing following loops in R?

for (i in 1:nrow(df1)) {
  counter <- 0
  for (j in 1:nrow(df2)) {
    if (df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]{counter = counter + 1}
  }
  df1$counter[i] <- counter
}

Please make this question *reproducible*. This includes sample code (including listing non-base R packages), sample data (e.g., `dput(head(x))`), expected output, and relevant errors or warnings. Refs: https://stackoverflow.com/questions/5963269, https://stackoverflow.com/help/mcve, and https://stackoverflow.com/tags/r/info. (This includes copying code that works ... this is missing a close-paren, I believe.) — r2evans, Oct 16 '18 at 00:09
Thanks, I used the sapply method. Howevere my data has more than 40k rows and it took more than an hour to run through. Do you have any suggestion to make it faster? — simk, Oct 16 '18 at 17:54
See my benchmark edit, the suggested method is the fourth (`outer`), about 1/3 the processing time (with smaller data). (I haven't tried with larger datasets, though I have no reason to think that methods 3 and 4 would not scale similarly.) — r2evans, Oct 16 '18 at 18:00
Great! I will test the outer method right away. Thanks for your respond. — simk, Oct 16 '18 at 18:14
So I tried to use outer but it gets too big very fast and terminates the operation with error of "can not allocate vector of size ..." and as I mentioned the sapply is very slow. any other suggestions? — simk, Oct 17 '18 at 16:55
Two 40K row frames will be difficult to do. It might work faster if they were all numeric and therefore `matrix`es, though that would require a little code-mod. Alternatively, you can use `outer` but break it up to "first 1000, second 1000, etc" of the smaller of the two frames, and then combine the results. Other than that, I don't have anything more. — r2evans, Oct 17 '18 at 17:02

r2evans · Accepted Answer · 2018-10-16T17:59:27.443

There are several ways to attack something like this. I'll demonstrate a few. Since you didn't provide data, look to the bottom for samples.

Fix the code you have (I think you are missing a close-paren):

for (i in 1:nrow(df1)) {
  counter1 <- 0
  for (j in 1:nrow(df2)) {
    if (df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]) { counter1 = counter1 + 1; }
  }
  df1$counter1[i] <- counter1
}
df1
#    a  b counter1
# 1  7 49        3
# 2 18 87        4
# 3 29  3        0
# 4 89 21        0
# 5 58 13        0
# 6 22 66        4
# 7 62 68        0
# 8 97 98        0

(From here on out, I will not show the output, rest assured it is the same. If you don't believe me, try it. I'll keep numbering the counter columns so you can see them side-by-side.)

We can capitalize on R's vectorizing of things. This means that instead of c(1+9, 2+9, 3+9), you can write c(1,2,3)+9 and do it all at once. Similarly, you can actually sum up a vector of boolean (logical) values, which should do what you would expect (sum(T,T,F) is 2). On those themes, let's remove the inner loop:
```
for (i in 1:nrow(df1)) {
  df1$counter2[i] <- sum(df2$x >= df1$a[i] & df2$x < df1$b[i])
}
```
This is still a little un-R-onic (adaptation of pythonic). Let's try one of the apply variants meant to operate on a simple vector and return a vector, which we'll capture as a counter:
```
df1$counter3 <- sapply(seq_len(nrow(df1)),
                       function(i) sum(df2$x >= df1$a[i] & df2$x < df1$b[i]))
```

Another technique is a less-frequent one, but can be useful at times (depending on how/where you apply it). The outer function effectively gives you all combinations of two vectors (similar to but distinct from expand.grid).

outer(seq_len(nrow(df1)), seq_len(nrow(df2)),
      function(i, j) df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i])
#       [,1]  [,2]  [,3]  [,4]  [,5]
# [1,] FALSE  TRUE  TRUE  TRUE FALSE
# [2,]  TRUE  TRUE FALSE  TRUE  TRUE
# [3,] FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE
# [6,]  TRUE  TRUE FALSE  TRUE  TRUE
# [7,] FALSE FALSE FALSE FALSE FALSE
# [8,] FALSE FALSE FALSE FALSE FALSE

There is actually only one call to the function, where if you were to peek when it is called, you would see this:

i
#  [1] 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4
# [37] 5 6 7 8
j
#  [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5
# [37] 5 5 5 5

From here, that inner function unrolls to something like:

# df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i] # i,j
  df2$x[1] >= df1$a[1] & df2$x[1] < df1$b[1] # 1,1
  df2$x[1] >= df1$a[2] & df2$x[1] < df1$b[2] # 2,1
  df2$x[1] >= df1$a[3] & df2$x[1] < df1$b[3] # 3,1
  # ...
  df2$x[1] >= df1$a[8] & df2$x[1] < df1$b[8] # 8,1
  df2$x[2] >= df1$a[1] & df2$x[2] < df1$b[1] # 1,2
  df2$x[2] >= df1$a[2] & df2$x[2] < df1$b[2] # 2,2
  # ...
  df2$x[5] >= df1$a[7] & df2$x[5] < df1$b[7] # 7,5
  df2$x[5] >= df1$a[8] & df2$x[5] < df1$b[8] # 8,5

and then gets shaped like a matrix with the appropriate number of rows and columns depending on the lengths of the input vectors. (There are lots of matrix-esque things you can do with this outer-product function, this is warping it from mathematical to lookup/calculate.)

Now that you have a matrix of logicals, it's easy enough to determine the sums of rows with colSums:

rowSums(outer(seq_len(nrow(df1)), seq_len(nrow(df2)),
              function(i, j) df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]))
# [1] 3 4 0 0 0 4 0 0

(which could have been assigned with df1$counter4 <- rowSums(...))

Data:

set.seed(20181015)
n1 <- 5
n2 <- 8
df1 <- data.frame(a = sample(100, size=n2), b = sample(100, size=n2))
df1
#    a  b
# 1  7 49
# 2 18 87
# 3 29  3
# 4 89 21
# 5 58 13
# 6 22 66
# 7 62 68
# 8 97 98
df2 <- data.frame(x = sample(100, size=n1))
df2
#    x
# 1 51
# 2 31
# 3 17
# 4 41
# 5 49

Benchmarking, for the curious:

library(microbenchmark)
microbenchmark(
  c1 = {
    for (i in 1:nrow(df1)) {
      counter1 <- 0
      for (j in 1:nrow(df2)) {
        if (df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]) { counter1 = counter1 + 1; }
      }
      df1$counter1[i] <- counter1
    }
  },
  c2 = {
    for (i in 1:nrow(df1)) {
      df1$counter2[i] <- sum(df2$x >= df1$a[i] & df2$x < df1$b[i])
    }
  },
  c3 = {
    sapply(seq_len(nrow(df1)),
           function(i) sum(df2$x >= df1$a[i] & df2$x < df1$b[i]))
  },
  c4 = {
    rowSums(outer(seq_len(nrow(df1)), seq_len(nrow(df2)),
                  function(i, j) df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]))
  },
  times=100
)
# Unit: microseconds
#  expr    min      lq     mean median      uq     max neval
#    c1 7022.1 7669.45 9608.953 8301.4 8989.25 19038.8   100
#    c2 4168.5 4634.00 5698.094 4998.5 5405.45 15927.4   100
#    c3  153.7  182.60  237.050  194.1  216.40  3209.6   100
#    c4   35.2   48.30   62.348   61.5   70.95   141.0   100

BTW, in case anyone is curious: my rationale for `seq_len(nrow(df))` over `1:nrow(df)` is defensive, and not specific to frames. Consider if `i=2`: `1:i` and `seq_len(i)` are identical; however, if `i=0`, then `seq_len(i)` returns an empty vector (so iterators never do anything, good), but `1:i` returns a vector length 2, which will likely break things. — r2evans, Oct 16 '18 at 18:19

Replacement of loops with if statement in R

1 Answers1