Avoid the use of for loops

Question

I'm working with R and I have a code like this:

for (i in 1:10)
   for (j in 1:100)
        if (data[i] == paths[j,1])
            cluster[i,4] <- paths[j,2]

where:

data is a vector with 100 rows and 1 column
paths is a matrix with 100 rows and 5 columns
cluster is a matrix with 100 rows and 5 columns

My question is: how could I avoid the use of "for" loops to iterate through the matrix? I don't know whether apply functions (lapply, tapply...) are useful in this case.

This is a problem when j=10000 for example because the execution time is very long.

Thank you

Do you intend really to save in 'cluster' the last matching 'paths'? — teucer, Jun 02 '10 at 11:59
Yes, what Musa and wkmor1 said... Did you really mean for i to got to 10... only testing the first 10 items in the 100 item vector data? --- The general answer to your question is that you have to start thinking in vectors rather than individual items. There are vastly faster ways to do something like what you're doing as soon as it makes sense. — John, Jun 02 '10 at 14:02
thank you guys. What i want to do is save in cluster column 4 values of path 2nd column when values from "data" are equal to values from "paths" and avoid the "for" sentence because when i have a lot of observations computational time increase highly — albergali, Jun 02 '10 at 19:28
And what do you want to do when they are not equal?... they're just ignored?... or is the cluster column 4 already set to something that doesn't change unless this condition is met. (sounds like a simple ifelse() command-- no loops-- check help) — John, Jun 02 '10 at 21:52
@albergali In last line of your code should be there `cluster[j,4] <- paths[j,2]` — Marek, Jun 03 '10 at 22:56

Marek · Answer 1 · 2010-06-03T23:09:03.377

Inner loop could be vectorized

cluster[i,4] <- paths[max(which(data[i]==paths[,1])),2]

but check Musa's comment. I think you indented something else.

Second (outer) loop could be vectorize either, by replicating vectors but

if i is only 100 your speed-up don't be large
it will need more RAM

[edit] As I understood your comment can you just use logical indexing?

indx <- data==paths[, 1]
cluster[indx, 4] <- paths[indx, 2]

gd047 · Answer 2 · 2010-06-04T05:34:06.537

1

I think that both loops can be vectorized using the following:

cluster[na.omit(match(paths[1:100,1],data[1:10])),4] = paths[!is.na(match(paths[1:100,1],data[1:10])),2]

edited Jun 04 '10 at 05:34

answered Jun 03 '10 at 08:20

gd047

29,749
18
107
146

I wonder how the performance of your vectorized solution compares to the looping alternative. – Guido Jun 04 '10 at 06:59
@Guido In this particular case it's hard to say cause results from original loop and gd047 solution differ, but in general difference between loop and vectorized code could be huge. Check my answer to http://stackoverflow.com/questions/2908822/speed-up-the-loop-operation-in-r, where from hours you can go to less than second. – Marek Jun 04 '10 at 19:40
@Marek Using randomized test matrices I got equal cluster matrices using both methods. I checked the results using `all.equal(loop_sol,vect_sol)` Which are the the test matrices that you have used and gave you different results? – gd047 Jun 04 '10 at 20:25
@gd047 Check this http://sites.google.com/site/fsh9rss8heh/ (too long for comment), I use R-2.10.1 – Marek Jun 04 '10 at 22:15
@Marek Thanks. You are right. In my examples there were not more than one matches between data[i] and paths[j,1]. In the general case where there are more than one, the dominant is the one that is checked last. I am not sure which one dominates in the vectorized way. Do you have any idea? – gd047 Jun 05 '10 at 06:17
@gd047 As states in `help("match")` return **positions of (first) matches**, so you could write you own version using rev `match_last <- function(x,y) length(y)-match(x,rev(y))+1` – Marek Jun 05 '10 at 12:24
@Marek Nice idea but there's still a difference. – gd047 Jun 05 '10 at 14:28

Avoid the use of for loops

2 Answers2

Linked