1

R is written for vector/matrix operations. It allows but is not happy with for() loops. Nested for() loops take forever

I've read that pretty much all for() loops can be turned into proper vector operations, but for the life of me, I can't figure out how to do it in this simple case:

I have two data tables, dt_a and dt_b of different lengths (dt_a: 1408 rows & dt_b: 2689 rows), with columns dt_a$x, dt_b$y, and dt_b$z. I want to search for matches of any value in of column dt_a$x in each value of dt_b$y and if they match, set dt_b$z <- dt_a$x. If there's no match, set it to "NOMATCH".

This is a programming 101 operation with for loops:

for (i in 1:2689) {
    for (j in 1:1408) {
        if (grepl(dt_a$x[j], dt_b$y[i], ignore.case=TRUE, perl=TRUE)) {
            dt_b$z[i] <- dt_a$x[j];
            break;
        }
        dt_a$z[i] <- "NOMATCH";
    }
}   

However, this operation takes more than 6 minutes to run, iterating through all the loops. I'll soon need to adapt it to a much larger data set, so the order of magnitude time increases will not be viable.

What's the correct way to do this nested for() loop operation using proper R vector operations?

Thanks!

Update

The answer by @nickk vectorizes one of the loops making the nesting unecessary and reducing the execution by an order of magnitude. I've credited it as most useful answer because I was able to get it to work in my code. The answers provided by @deanmacgregor were very useful in helping me understand more about what is going on. I couldn't get them to run in my code, but that's probably my fault for not understanding something. The cross-join approach, in particular, is probably the best solution. I need more practice in order to make it work with my data, but I don't want to wait too long before resolving this question.

Additional thanks to @romantsegelskyi for teaching me proper question formatting, and to @pierrelafortune and @brodieG for teaching me the importance and content of reproducible questions. ^_^

I've credited you all in my source code which will (someday) be released as open source.

Mekki MacAulay
  • 1,727
  • 2
  • 12
  • 23
  • Are you searching for patterns, or literal values? Is there only one possible match? Please try to make your **[problem reproducible](http://stackoverflow.com/a/28481250/2725969)**. – BrodieG Jul 10 '15 at 18:11
  • In the basic case, I'm just matching on one literal value "contained" (grep) anywhere in the other literal value. However, in the general case, I would like to search based on patterns, but that doesn't seem specifically related to the nested for() loops question, so I didn't want to overburden this question. If it's related, then I must be lacking some understanding. – Mekki MacAulay Jul 10 '15 at 18:19

4 Answers4

2

Just saw from comments that exact matches don't work. Here's a new method using crossjoin

library(data.table)
#make dummy data
dt_a<-data.table(x=unlist(lapply(1:1408, function(x) paste0(LETTERS[as.integer(runif(3,1,26))],collapse=""))))
dt_b<-data.table(y=unlist(lapply(1:2689, function(x) paste0(letters[as.integer(runif(4,1,26))],collapse=""))))
#remove dupes from dummy data
dt_a<-unique(dt_a)
dt_b<-unique(dt_b)


#make crossjoin
cross<-CJ(x=dt_a[,x],y=dt_b[,y])
#make column that is true for match/false for non-match
cross[,Match:=grepl(x,y,ignore.case = T),by=x]
#make z column corresponding to match
cross[,z:="NOMATCH"]
cross[Match==TRUE,z:=x]
#get rid of Match and x column
cross[,Match:=NULL]
cross[,x:=NULL]
#helper function to deal with all the extra rows
fixZ<-function(x) {
  if(any(x!="NOMATCH")) {
    return(x[!x%in%"NOMATCH"])
  } else {
    return("NOMATCH")
  }
}
#run helper function on column z for every y value
dt_b<-unique(cross[,list(z=fixZ(z)),by="y"])

This is old:

Use the %in% operator.

library(data.table)
#make dummy data
dt_a<-data.table(x=unlist(lapply(1:1408, function(x) paste0(LETTERS[as.integer(runif(3,1,26))],collapse=""))))
dt_b<-data.table(y=unlist(lapply(1:2689, function(x) paste0(letters[as.integer(runif(3,1,26))],collapse=""))))
#remove dupes from dummy data
dt_a<-unique(dt_a)
dt_b<-unique(dt_b)
#make dummy upper case versions of x and y for case insensitive comparison
dt_b[,upper:=toupper(y)]
dt_a[,upper:=toupper(x)]
#make default z
dt_b[,z:="NOMATCH"]    
#set z to y when y exists in x
dt_b[upper %in% dt_a[,upper],z:=y]   
#replace z with x so case of z matches case of x
setkey(dt_a,upper)
setkey(dt_b,upper)
dt_b[dt_a,z:=ifelse(!is.na(z),x,NA)]


#delete dummy variables
dt_b[,upper:=NULL]
dt_a[,upper:=NULL]
Dean MacGregor
  • 11,847
  • 9
  • 34
  • 72
  • This may be what I need. In my mind this is "vectorizing". I was probably using the wrong term. I will test and update. Thanks! – Mekki MacAulay Jul 10 '15 at 21:20
  • 1
    `%in%` _is_ a vector operation – MichaelChirico Jul 10 '15 at 21:32
  • @MichaelChirico touche. – Dean MacGregor Jul 10 '15 at 21:33
  • You could simplify the last 4 lines before delete dummy to `setkey(dt_a, upper); setkey(dt_b, upper); dt_b[dt_a, z:=x]` – Nick Kennedy Jul 10 '15 at 23:04
  • I've tried both of these approaches and I can't seem to get it to work. I'm probably making a mistake. I think the `cross` is possibly the approach I'm thinking of. The `%in%` approach is what makes the most logical sense to me, only the lines `#set z to y when y exists in x dt_b[upper %in% dt_a[,upper],z:=y]` seem to have it backwards in my head. I want to look for when x exists in y. It would work both ways with numbers, but with text blocks it works differently. I understand why data is so important now in examples. :) I've learned a lot along the way anyways. Thanks! – Mekki MacAulay Jul 11 '15 at 17:35
1

Here's an example to think about vectorization:

dt_a <- c(1,2,3)
dt_b <- c(3,2,1,0)
dt_a == dt_b
# [1] FALSE  TRUE FALSE FALSE
# Warning message:
# In dt_a == dt_b :
#   longer object length is not a multiple of shorter object length

They are of unequal length. The evaluator will complete the action, but it will warn us that the smaller vector is being recycled. If we are sure that we only want to compare the values the length of dt_a we can subset dt_b up to that length for an equal length match.

dt_a == dt_b[seq_along(dt_a)]
#[1] FALSE  TRUE FALSE

From there it is easy to vectorize:

dt_z <- ifelse(dt_a == dt_b[seq_along(dt_a)], dt_a, "NOMATCH")
dt_z
#[1] "NOMATCH" "2"       "NOMATCH"

Update

Let us emphasize the importance of a reproducible example. It gives everyone on the site a chance to try out different approaches. Here is another example with your loop recoded. Is this what your loop currently does?

a <- c(5,0,9)
b <- c(2,5,0,1,9)
c <- c()
d <- c()
for (i in 1:5) {
    for (j in 1:3) {
        if (grepl(a[j], b[i], ignore.case=TRUE, perl=TRUE)) {
            c[i] <- a[j];
            break;
        }
        d[i] <- "NOMATCH";
    }
}

c
[1] NA  5  0 NA  9

d
[1] "NOMATCH" NA        "NOMATCH" "NOMATCH" "NOMATCH"
Pierre L
  • 28,203
  • 6
  • 47
  • 69
  • If I'm understanding your example correctly, `seq_along()` might be exactly what I need. I was using the nested `for()` loop to generate a sequence. I'll test and get back to you for credit. Thanks! – Mekki MacAulay Jul 10 '15 at 18:33
  • Yes because it will count all of the second column up to the length of the first column and ignore the rest. – Pierre L Jul 10 '15 at 18:36
  • Hmm, so that doesn't work. I want to iterate through ALL of `dt_a` for EACH of `dt_b`, position independent. Looking at your example, the vectorized output would be `#[1] "1" "2" "3"`. – Mekki MacAulay Jul 10 '15 at 18:48
  • Okay do understand why we say you need a reproducible example? I came up with an example in 30 seconds. You can take the time to make up a few vectors and try out different approaches. There is no time to guess what you are imagining. – Pierre L Jul 10 '15 at 19:02
  • It seems by "reproducible example", you mean sample data. The code I provided is data independent. I'm new to Stack Overflow. I'll do better next time and create a sample data set as well. – Mekki MacAulay Jul 10 '15 at 19:16
  • If you are suggesting that I'm making up my own definition of "reproducible", you can locate many sources that will concur on the definition. "Sample data" is one aspect of reproducibility, but the strength of the method is preicision in results and process. – Pierre L Jul 10 '15 at 19:22
  • I'll post an update with your example format and data shortly. Thanks. – Mekki MacAulay Jul 10 '15 at 19:23
  • I was suggesting that you have a knowledge of specific terms and their detailed connotations that I didn't yet have. I have now read @BrodieG 's [reproducible problem](http://stackoverflow.com/a/28481250/2725969) link and I now understand what you mean. Please just be patient. I'm not deliberately trying to waste your time. Incidentally, this is [good reading](https://blog.stackexchange.com/2012/07/kicking-off-the-summer-of-love/). – Mekki MacAulay Jul 10 '15 at 19:32
  • It is the summer of love. Welcome to SO. Did you see the updated answer? – Pierre L Jul 10 '15 at 19:40
  • Thanks for the welcome and your updated answer. I will test and get back to you. – Mekki MacAulay Jul 10 '15 at 21:21
  • Yes, your updated answer is what my loop currently does, only I've learned from other answers why numerical data matching is in fact different. The importance of sample data is now very clear to me! Thanks! I've learned a lot from my first question. :) – Mekki MacAulay Jul 11 '15 at 17:36
0

Vector operations (apply/lapply/sapply/mapply) in R are not directly equivalent to for/while loops. apply does exactly as it says: it applies a function on a number of similar arguments sequentially, and returns the result. Hence, by definition you cannot break out of an apply. This was discussed on R forums some time ago.

Moreover, you can only access the global environment and even change variables using assign or <<-, but this is pretty dangerous.

So it requires a bit of rethinking of what you want to achieve before you will be able to convert it to vectorized operations.

> x <- 7:11
> y <- 1:10
> z <- rep("No match", 5)
> ind <- which(apply(sapply(x, grepl, y), 2, any) == T)
> ind
[1] 1 2 3 4
> m.val <- which(apply(sapply(x, grepl, y), 1, any) == TRUE)
> m.val
[1]  7  8  9 10
> z[ind] <- y[m.val]
> z
[1] "7"        "8"        "9"        "10"       "No match"

That doesn't seem like much simplification though

romants
  • 3,660
  • 1
  • 21
  • 33
  • This seems to work nicely for corresponding elements (`apply()`), but I'm looking for each element in one vector in all the elements of the other vector, not just in the same positions. Can `apply()` and/or `sapply()` be used in that manner? Also, thanks for the formatting tips! :) I'll get the hang of this. – Mekki MacAulay Jul 10 '15 at 18:59
  • The `apply` family of functions are just wrappers for `for` loops and will perform equivalently. The exception is that (I forget which) one of them converts dataframes into matrices first so it gives the perception of increasing performance but really it is because R works on matrices faster than dataframes. – Dean MacGregor Jul 10 '15 at 19:39
  • @DeanMacGregor one advantage is that they preallocate the output vector so they avoid the tendency to do `x <- numeric(0); for (i in 1:10) x <- c(x, i)` – Nick Kennedy Jul 10 '15 at 21:40
  • @NickK That's a good point since adding rows to matrices (or dataframes) or values to vectors is expensive. That being said, if you – Dean MacGregor Jul 10 '15 at 21:42
0
dt_b[, z := NA]
for (x in dt_a$x) {
  found <- grepl(x, dt_b$y, ignore.case=TRUE, perl=TRUE)
  dt_b[found & is.na(z), z := x]
}
dt_b[is.na(z), z := "NOMATCH"]

This is closer to the functionality of the original than the other answers so far. dt_a$x can have any valid PCRE pattern rather than looking for exact matches. Using @DeanMacGregor's data, it takes a few seconds to run on my machine.

Note that it takes advantage of the fact that grepl is vectorised. Working through dt_a$x and only replacing NA values replicates the effect of the break seen before.

For slightly faster results, this will do in place of the grepl line.

  found <- stringi::stri_detect_regex(dt_b$y, x, opts_regex = stri_opts_regex(case_insensitive = TRUE))
Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
  • Your example works as I intended using the`grepl` vectorization to avoid the nested `for()`. Awesome! It's an order of magnitude faster this way. And, I learned the difference between `data.table` and `data.frame` along the way (I was actually using `data.frames`, not `data.tables`!) Thanks! :) – Mekki MacAulay Jul 11 '15 at 16:42