2

I am trying to compare 1st row of a matrix with all rows of the same matrix. But the vectorized comparison is not returning correct results. Any reason why this may be happening?

m <- matrix(c(1,2,3,1,2,4), nrow=2, ncol=3, byrow=TRUE)

> m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    1    2    4

> # Why does the first row not have 3 TRUE values?
> m[1,] == m
      [,1]  [,2]  [,3]
[1,]  TRUE FALSE FALSE
[2,] FALSE FALSE FALSE

> m[1,] == m[1,]
[1] TRUE TRUE TRUE

> m[1,] == m[2,]
[1]  TRUE  TRUE FALSE

Follow-up. In my actual data I have large number of rows then (atleast 10million) then both time and memory adds up. Additional suggestions on the below as suggested below by others?

m <- matrix(rep(c(1,2,3), 1000000), ncol=3, byrow=TRUE)

> #by @alexis_laz
> m1 <- matrix(m[1,], nrow = nrow(m), ncol = ncol(m), byrow = T)
> system.time(m == m1)
   user  system elapsed 
   0.21    0.03    0.31

> object.size(m1)
24000112 bytes

> #by @PaulHiemstra
> system.time( t(apply(m, 1, function(x) x == m[1,])) )
   user  system elapsed 
  35.18    0.08   36.04 

Follow-up 2. @alexis_laz you are correct. I want to compare every row with each other and have posted a followup question on that ( How to vectorize comparing each row of matrix with all other rows)

Community
  • 1
  • 1
user3147662
  • 155
  • 6
  • You can, also, turn `m[1,]` to a matrix and do the comparisons: `m == matrix(m[1,], nrow = nrow(m), ncol = ncol(m), byrow = T)` – alexis_laz Dec 30 '13 at 22:32
  • 1
    Following your follow-up, what is your final goal? E.g. you want to test every row against the rest of the matrix for equality? You want to test the first row of many matrices against the rest of each matrix? Or you just want to test the first row of a single matrix against the rest of it as fast as possible? – alexis_laz Dec 31 '13 at 15:05

2 Answers2

4

In the comparison m[1,] == m, the first term m[1,] is recycled (once) to equal the length of m. The comparison is then done column-wise.

You're comparing c(1,2,3) with c(1,1,2,2,3,4), thus c(1,2,3,1,2,3) with c(1,1,2,2,3,3,4) so you have one TRUE followed by five FALSE (and packaged as a matrix to match the dimensions of m).

Matthew Lundberg
  • 42,009
  • 6
  • 90
  • 112
2

As @MatthewLundberg pointed out, the recycling rules of R do not behave as you expected. In my opinion it is always better to explicitely state what to compare and not rely on R's assumptions. One way to make the correct comparison:

t(apply(m, 1, function(x) x == m[1,]))
     [,1] [,2]  [,3]
[1,] TRUE TRUE  TRUE
[2,] TRUE TRUE FALSE

or:

m == rbind(m[1,], m[1,])
     [,1] [,2]  [,3]
[1,] TRUE TRUE  TRUE
[2,] TRUE TRUE FALSE

or by making R's recyling working in your favor (thanks to @Arun):

t(t(m) == m[1,])
     [,1] [,2]  [,3]
[1,] TRUE TRUE  TRUE
[2,] TRUE TRUE FALSE
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • The issue, more than recycling, is due to colum-wise comparisons, no? – Arun Dec 30 '13 at 22:23
  • @Arun You are right, as Matthew already answered. I simply provided a solution, two of which do not rely on R's recyling rules. – Paul Hiemstra Dec 30 '13 at 22:25
  • 2
    +1, nice that you also offer solutions. For the 2nd case, one could also do: `m == m[c(1L,1L), ]`. – Arun Dec 30 '13 at 22:27
  • I actually have 1e+06 rows in my matrix so I was aiming for vectorization. Additional suggestions ? `m <- matrix(rep(c(1,2,3), 1000000), ncol=3, byrow=TRUE) system.time( t(apply(m, 1, function(x) x == m[1,])) ) user system elapsed 37.07 1.06 40.63` – user3147662 Dec 30 '13 at 23:10
  • Try @Arun's solution, should be faster at the expense of RAM usage. – Paul Hiemstra Dec 31 '13 at 06:54