1

Can anyone explain why these two correlation matrices return different results?

library(recommenderlab)
data(MovieLense)
cor_mat <- as( similarity(MovieLense, method = "pearson", which = "items"), "matrix" )
cor_mat_base <- suppressWarnings( cor(as(MovieLense, "matrix"), use = "pairwise.complete.obs") )
print( cor_mat[1:5, 1:5] )
print( cor_mat_base[1:5, 1:5] )
  • 1
    Why do(did) you expect them to return the same results? Is `cor_mat` using complete observations too? – NelsonGon Jun 19 '19 at 14:14
  • 1
    @NelsonGon I'm not sure I understand the question. I would assume both functions would use only the paired values where neither value is NA (because I don't know how a correlation could be run otherwise.) Whichever non-NA-producing argument I supply to "use = " I get the same result. Unless I am misunderstanding? – Miles Coltrane Jun 19 '19 at 14:24
  • For the `use` part, please take a look at this: https://stackoverflow.com/questions/18892051/complete-obs-of-cor-function and this: https://stats.stackexchange.com/questions/262925/is-there-a-serious-problem-with-dropping-observations-with-missing-values-when-c – NelsonGon Jun 19 '19 at 14:28

1 Answers1

2

The dissimilarity() = 1 - pmax(cor(), 0) R base function. Also, it is important to specify the method for both of them to use the same one:

library("recommenderlab")
data(MovieLense)
cor_mat <- as( dissimilarity(MovieLense, method = "pearson", 
                          which = "items"), "matrix" )
cor_mat_base <- suppressWarnings( cor(as(MovieLense, "matrix"), method = "pearson"
                                      , use = "pairwise.complete.obs") )
print( cor_mat[1:5, 1:5] )
print(1- cor_mat_base[1:5, 1:5] )

> print( cor_mat[1:5, 1:5] )
                  Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995) Copycat (1995)
Toy Story (1995)         0.0000000        0.7782159         0.8242057         0.8968647      0.6135248
GoldenEye (1995)         0.7782159        0.0000000         0.7694644         0.7554443      0.7824406
Four Rooms (1995)        0.8242057        0.7694644         0.0000000         1.0000000      0.8153877
Get Shorty (1995)        0.8968647        0.7554443         1.0000000         0.0000000      1.0000000
Copycat (1995)           0.6135248        0.7824406         0.8153877         1.0000000      0.0000000
> print(1- cor_mat_base[1:5, 1:5] )
                  Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995) Copycat (1995)
Toy Story (1995)         0.0000000        0.7782159         0.8242057         0.8968647      0.6135248
GoldenEye (1995)         0.7782159        0.0000000         0.7694644         0.7554443      0.7824406
Four Rooms (1995)        0.8242057        0.7694644         0.0000000         1.2019687      0.8153877
Get Shorty (1995)        0.8968647        0.7554443         1.2019687         0.0000000      1.2373503
Copycat (1995)           0.6135248        0.7824406         0.8153877         1.2373503      0.0000000

To understand it well, check the details of both packages :).

OP/ EDIT: It is important to point out that there are some values that are a little different between even 1-dissimilarity and cor, having cor bigger than 1. This is because dissimilarity() sets a floor at 0 (i.e., does not return negative numbers), and also doing cor() could return values greater than 1. https://www.rdocumentation.org/packages/stats/versions/3.6.0/topics/cor they only specify that

For r <- cor(*, use = "all.obs"), it is now guaranteed that all(abs(r) <= 1).

This should be evaluated.

Carles
  • 2,731
  • 14
  • 25
  • I think in `cor`, pearson is the default method. – NelsonGon Jun 19 '19 at 14:29
  • Curious, why switch similarity with dissimilarity? Are they truly the same thing? – NelsonGon Jun 19 '19 at 14:40
  • @CarlesSansFuentes Thanks so much! Switching to dissimilarity() function and then subtracting it from 1 worked. I still don't quite understand what similarity() is doing then. The Help details say "Similarities are computed from dissimilarities using s=1/(1+d) or s=1-d depending on the measure. For Pearson we use 1 - positive correlation." So for Pearson shouldn't similarity() and 1-dissimilarity() be identical? – Miles Coltrane Jun 19 '19 at 14:54
  • Well, this is explained in details and involves the definition of similarity and dissimilarity. Check `?dissimilarity` in R. In the details, it says` Similarities are computed from dissimilarities using s=1/(1+d) or s=1-d depending on the measure. For Pearson we use 1 - positive correlation.` – Carles Jun 19 '19 at 14:58
  • 1
    @CarlesSansFuentes I just saw your EDIT, but I think it is incorrect. You are subtracting 1 - cor and cor can be a negative number, that is why you are getting a number greater than 1. You should subtract 1 - dissimilarity and then you'll get similar answers (though not identical because it seems dissimilarity() sets a floor at 0 and does not allow negative numbers.) – Miles Coltrane Jun 19 '19 at 14:59
  • @MilesColtrane, can you edit my comment such that I can see what you are explicitly saying how it should be correct? I am not sure about what I wrote wrong. – Carles Jun 19 '19 at 15:42
  • @MilesColtrane, thank you for the edit, I have reedited it to compact it as a better answer. – Carles Jun 19 '19 at 17:27