Different results from base R cor() function than similarity() function in recommenderlab package?

Question

Can anyone explain why these two correlation matrices return different results?

library(recommenderlab)
data(MovieLense)
cor_mat <- as( similarity(MovieLense, method = "pearson", which = "items"), "matrix" )
cor_mat_base <- suppressWarnings( cor(as(MovieLense, "matrix"), use = "pairwise.complete.obs") )
print( cor_mat[1:5, 1:5] )
print( cor_mat_base[1:5, 1:5] )

Why do(did) you expect them to return the same results? Is `cor_mat` using complete observations too? — NelsonGon, Jun 19 '19 at 14:14
@NelsonGon I'm not sure I understand the question. I would assume both functions would use only the paired values where neither value is NA (because I don't know how a correlation could be run otherwise.) Whichever non-NA-producing argument I supply to "use = " I get the same result. Unless I am misunderstanding? — Miles Coltrane, Jun 19 '19 at 14:24
For the `use` part, please take a look at this: https://stackoverflow.com/questions/18892051/complete-obs-of-cor-function and this: https://stats.stackexchange.com/questions/262925/is-there-a-serious-problem-with-dropping-observations-with-missing-values-when-c — NelsonGon, Jun 19 '19 at 14:28

Carles · Accepted Answer · 2019-06-19T17:26:12.903

The dissimilarity() = 1 - pmax(cor(), 0) R base function. Also, it is important to specify the method for both of them to use the same one:

library("recommenderlab")
data(MovieLense)
cor_mat <- as( dissimilarity(MovieLense, method = "pearson", 
                          which = "items"), "matrix" )
cor_mat_base <- suppressWarnings( cor(as(MovieLense, "matrix"), method = "pearson"
                                      , use = "pairwise.complete.obs") )
print( cor_mat[1:5, 1:5] )
print(1- cor_mat_base[1:5, 1:5] )

> print( cor_mat[1:5, 1:5] )
                  Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995) Copycat (1995)
Toy Story (1995)         0.0000000        0.7782159         0.8242057         0.8968647      0.6135248
GoldenEye (1995)         0.7782159        0.0000000         0.7694644         0.7554443      0.7824406
Four Rooms (1995)        0.8242057        0.7694644         0.0000000         1.0000000      0.8153877
Get Shorty (1995)        0.8968647        0.7554443         1.0000000         0.0000000      1.0000000
Copycat (1995)           0.6135248        0.7824406         0.8153877         1.0000000      0.0000000
> print(1- cor_mat_base[1:5, 1:5] )
                  Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995) Copycat (1995)
Toy Story (1995)         0.0000000        0.7782159         0.8242057         0.8968647      0.6135248
GoldenEye (1995)         0.7782159        0.0000000         0.7694644         0.7554443      0.7824406
Four Rooms (1995)        0.8242057        0.7694644         0.0000000         1.2019687      0.8153877
Get Shorty (1995)        0.8968647        0.7554443         1.2019687         0.0000000      1.2373503
Copycat (1995)           0.6135248        0.7824406         0.8153877         1.2373503      0.0000000

To understand it well, check the details of both packages :).

OP/ EDIT: It is important to point out that there are some values that are a little different between even 1-dissimilarity and cor, having cor bigger than 1. This is because dissimilarity() sets a floor at 0 (i.e., does not return negative numbers), and also doing cor() could return values greater than 1. https://www.rdocumentation.org/packages/stats/versions/3.6.0/topics/cor they only specify that

For r <- cor(*, use = "all.obs"), it is now guaranteed that all(abs(r) <= 1).

This should be evaluated.

Curious, why switch similarity with dissimilarity? Are they truly the same thing? — NelsonGon, Jun 19 '19 at 14:40
@CarlesSansFuentes Thanks so much! Switching to dissimilarity() function and then subtracting it from 1 worked. I still don't quite understand what similarity() is doing then. The Help details say "Similarities are computed from dissimilarities using s=1/(1+d) or s=1-d depending on the measure. For Pearson we use 1 - positive correlation." So for Pearson shouldn't similarity() and 1-dissimilarity() be identical? — Miles Coltrane, Jun 19 '19 at 14:54
Well, this is explained in details and involves the definition of similarity and dissimilarity. Check `?dissimilarity` in R. In the details, it says` Similarities are computed from dissimilarities using s=1/(1+d) or s=1-d depending on the measure. For Pearson we use 1 - positive correlation.` — Carles, Jun 19 '19 at 14:58
@CarlesSansFuentes I just saw your EDIT, but I think it is incorrect. You are subtracting 1 - cor and cor can be a negative number, that is why you are getting a number greater than 1. You should subtract 1 - dissimilarity and then you'll get similar answers (though not identical because it seems dissimilarity() sets a floor at 0 and does not allow negative numbers.) — Miles Coltrane, Jun 19 '19 at 14:59
@MilesColtrane, can you edit my comment such that I can see what you are explicitly saying how it should be correct? I am not sure about what I wrote wrong. — Carles, Jun 19 '19 at 15:42
@MilesColtrane, thank you for the edit, I have reedited it to compact it as a better answer. — Carles, Jun 19 '19 at 17:27

Different results from base R cor() function than similarity() function in recommenderlab package?

1 Answers1