Hi what would be the best way of doing following loops in R?
for (i in 1:nrow(df1)) {
counter <- 0
for (j in 1:nrow(df2)) {
if (df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]{counter = counter + 1}
}
df1$counter[i] <- counter
}
Hi what would be the best way of doing following loops in R?
for (i in 1:nrow(df1)) {
counter <- 0
for (j in 1:nrow(df2)) {
if (df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]{counter = counter + 1}
}
df1$counter[i] <- counter
}
There are several ways to attack something like this. I'll demonstrate a few. Since you didn't provide data, look to the bottom for samples.
Fix the code you have (I think you are missing a close-paren):
for (i in 1:nrow(df1)) {
counter1 <- 0
for (j in 1:nrow(df2)) {
if (df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]) { counter1 = counter1 + 1; }
}
df1$counter1[i] <- counter1
}
df1
# a b counter1
# 1 7 49 3
# 2 18 87 4
# 3 29 3 0
# 4 89 21 0
# 5 58 13 0
# 6 22 66 4
# 7 62 68 0
# 8 97 98 0
(From here on out, I will not show the output, rest assured it is the same. If you don't believe me, try it. I'll keep numbering the counter
columns so you can see them side-by-side.)
We can capitalize on R's vectorizing of things. This means that instead of c(1+9, 2+9, 3+9)
, you can write c(1,2,3)+9
and do it all at once. Similarly, you can actually sum up a vector of boolean (logical
) values, which should do what you would expect (sum(T,T,F)
is 2). On those themes, let's remove the inner loop:
for (i in 1:nrow(df1)) {
df1$counter2[i] <- sum(df2$x >= df1$a[i] & df2$x < df1$b[i])
}
This is still a little un-R-onic (adaptation of pythonic). Let's try one of the apply
variants meant to operate on a simple vector and return a vector, which we'll capture as a counter:
df1$counter3 <- sapply(seq_len(nrow(df1)),
function(i) sum(df2$x >= df1$a[i] & df2$x < df1$b[i]))
Another technique is a less-frequent one, but can be useful at times (depending on how/where you apply it). The outer
function effectively gives you all combinations of two vectors (similar to but distinct from expand.grid
).
outer(seq_len(nrow(df1)), seq_len(nrow(df2)),
function(i, j) df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i])
# [,1] [,2] [,3] [,4] [,5]
# [1,] FALSE TRUE TRUE TRUE FALSE
# [2,] TRUE TRUE FALSE TRUE TRUE
# [3,] FALSE FALSE FALSE FALSE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE FALSE
# [6,] TRUE TRUE FALSE TRUE TRUE
# [7,] FALSE FALSE FALSE FALSE FALSE
# [8,] FALSE FALSE FALSE FALSE FALSE
There is actually only one call to the function, where if you were to peek when it is called, you would see this:
i
# [1] 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4
# [37] 5 6 7 8
j
# [1] 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 5 5 5 5
# [37] 5 5 5 5
From here, that inner function unrolls to something like:
# df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i] # i,j
df2$x[1] >= df1$a[1] & df2$x[1] < df1$b[1] # 1,1
df2$x[1] >= df1$a[2] & df2$x[1] < df1$b[2] # 2,1
df2$x[1] >= df1$a[3] & df2$x[1] < df1$b[3] # 3,1
# ...
df2$x[1] >= df1$a[8] & df2$x[1] < df1$b[8] # 8,1
df2$x[2] >= df1$a[1] & df2$x[2] < df1$b[1] # 1,2
df2$x[2] >= df1$a[2] & df2$x[2] < df1$b[2] # 2,2
# ...
df2$x[5] >= df1$a[7] & df2$x[5] < df1$b[7] # 7,5
df2$x[5] >= df1$a[8] & df2$x[5] < df1$b[8] # 8,5
and then gets shaped like a matrix
with the appropriate number of rows and columns depending on the lengths of the input vectors. (There are lots of matrix-esque things you can do with this outer
-product function, this is warping it from mathematical to lookup/calculate.)
Now that you have a matrix
of logical
s, it's easy enough to determine the sums of rows with colSums
:
rowSums(outer(seq_len(nrow(df1)), seq_len(nrow(df2)),
function(i, j) df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]))
# [1] 3 4 0 0 0 4 0 0
(which could have been assigned with df1$counter4 <- rowSums(...)
)
Data:
set.seed(20181015)
n1 <- 5
n2 <- 8
df1 <- data.frame(a = sample(100, size=n2), b = sample(100, size=n2))
df1
# a b
# 1 7 49
# 2 18 87
# 3 29 3
# 4 89 21
# 5 58 13
# 6 22 66
# 7 62 68
# 8 97 98
df2 <- data.frame(x = sample(100, size=n1))
df2
# x
# 1 51
# 2 31
# 3 17
# 4 41
# 5 49
Benchmarking, for the curious:
library(microbenchmark)
microbenchmark(
c1 = {
for (i in 1:nrow(df1)) {
counter1 <- 0
for (j in 1:nrow(df2)) {
if (df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]) { counter1 = counter1 + 1; }
}
df1$counter1[i] <- counter1
}
},
c2 = {
for (i in 1:nrow(df1)) {
df1$counter2[i] <- sum(df2$x >= df1$a[i] & df2$x < df1$b[i])
}
},
c3 = {
sapply(seq_len(nrow(df1)),
function(i) sum(df2$x >= df1$a[i] & df2$x < df1$b[i]))
},
c4 = {
rowSums(outer(seq_len(nrow(df1)), seq_len(nrow(df2)),
function(i, j) df2$x[j] >= df1$a[i] & df2$x[j] < df1$b[i]))
},
times=100
)
# Unit: microseconds
# expr min lq mean median uq max neval
# c1 7022.1 7669.45 9608.953 8301.4 8989.25 19038.8 100
# c2 4168.5 4634.00 5698.094 4998.5 5405.45 15927.4 100
# c3 153.7 182.60 237.050 194.1 216.40 3209.6 100
# c4 35.2 48.30 62.348 61.5 70.95 141.0 100