R merged loop performance

Question

I have 2000 rows of data for 4000 columns. What I'm trying to do is to compare each row to the rest of the rows and see how similar they are in terms of different columns/total columns.

What I did so far is as follows:

for (i in 1:nrow(data))
{
    for (j in (i+1):nrow(data))
    { 
        mycount[[i,j]] = length(which(data[i,] != data[j,]))
    }
}

~~There are 2 problems with it, j doesn't start from i+1 (which is probably a basic mistake)~~ The main problem however is time it consumes, it takes ages...

Could someone please suggest a more proper way to achieve the same result, result being the percentage of each rows similarity to the other rows?

Here's an example of data and what I want to achieve: screenshot of the image

The output should be something like:

mycount[1,2] = 2 (S# and var3 columns are different)
mycount[1,3] = 2 (S# and var1 columns are different)
mycount[1,4] = 2 (S# and var4 columns are different)
mycount[2,3] = ...
mycount[2,4] = ...
mycount[3,4] =  3 (S#, var1 and var 4 are different)

So you want to compare each row to the row directly beneath it to see if it's identical? — sebastian-c, Nov 22 '16 at 10:06
The not starting from i+1 is because of `i+1:nrow(data)`, it reads this as `i + 1:nrow(data)` you need to add parantheses: `(i+1):nrow(data)`. — Marijn Stevering, Nov 22 '16 at 10:06
Could you please add a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)? — fridaymeetssunday, Nov 22 '16 at 10:08
@sebastian-c; not just with the one beneath, but with all rows remaining — Bahadir Ozkurt, Nov 22 '16 at 12:50
@fridaymeetssunday I hope the revised version makes it clearer — Bahadir Ozkurt, Nov 22 '16 at 12:51

konvas · Answer 1 · 2016-11-22T11:07:18.767

One problem in your code is that the value of mycount[[i]] is updated in each iteration of the j loop (the previous value is overwritten) so what you end up with is mycount[[i]] being equal to length(which(data[i,] != data[nrow(data),])). Another issue is that i+1:nrow(data) does not produce the numbers i+1, i+2, ... nrow(data) but i + (1:nrow(data)). So what you want is either (i + 1):nrow(data) or seq(i + 1, nrow(data)).

You can try the following code, which will be faster than the double loop (probably still too slow though)

rows <- lapply(seq(nrow(data)), function(i) data[i, ])
outer(X = rows, Y = rows, FUN = Vectorize(function(x, y) sum(x == y)))

R merged loop performance

1 Answers1

Linked