15

After learning about the options for working with sparse matrices in R, I want to use the Matrix package to create a sparse matrix from the following data frame and have all other elements be NA.

     s    r d
1 1089 3772 1
2 1109  190 1
3 1109 2460 1
4 1109 3071 2
5 1109 3618 1
6 1109   38 7

I know I can create a sparse matrix with the following, accessing elements as usual:

> library(Matrix)
> Y <- sparseMatrix(s,r,x=d)
> Y[1089,3772]
[1] 1
> Y[1,1]
[1] 0

but if I want to have the default value to be NA, I tried the following:

  M <- Matrix(NA,max(s),max(r),sparse=TRUE)
  for (i in 1:nrow(X))
    M[s[i],r[i]] <- d[i]

and got this error

Error in checkSlotAssignment(object, name, value) : 
  assignment of an object of class "numeric" is not valid for slot "x" in an object of class "lgCMatrix"; is(value, "logical") is not TRUE

Not only that, I find that one takes much longer to access to elements.

> system.time(Y[3,3])
   user  system elapsed 
  0.000   0.000   0.003 
> system.time(M[3,3])
   user  system elapsed 
  0.660   0.032   0.995 

How should I be creating this matrix? Why is one matrix so much slower to work with?

Here's the code snippet for the above data:

X <- structure(list(s = c(1089, 1109, 1109, 1109, 1109, 1109), r = c(3772, 
190, 2460, 3071, 3618, 38), d = c(1, 1, 1, 2, 1, 7)), .Names = c("s", 
"r", "d"), row.names = c(NA, 6L), class = "data.frame")
Community
  • 1
  • 1
Christopher DuBois
  • 42,350
  • 23
  • 71
  • 93

2 Answers2

16

Why do you want default NA values? As far as I know matrices are only sparse if they have zero-cells. As NA is a non-zero value, you loose all the benefits from the sparse matrix. A classic matrix is even more efficient if the matrix has hardly any zeros. A classic matrix is like a vector that will be cut according to the dimensions. So it only has to store the data vector and the dimensions. The sparse matrix stores only the non-zero values, but also stores there location. This is an advantage if and only if you have enough zero values.

Thierry
  • 18,049
  • 5
  • 48
  • 66
  • 1
    But if my "default" value is 1 then surely you just have 1 extra bit of information to store, i.e. that the default is 1 instead of assuming 0. I still store the "different from default" values as you do in the 0 example but the premise is much more general. – adunaic Sep 30 '14 at 14:49
  • 2
    "This is an advantage if and only if you have enough zero values.": Simply not true. Replace every occurrence of "zero" in your comment by "one" or any other number and you will see that your sentence still holds. The fact that zero is used is just by convention and there are many applications where it makes sense to have default values other than zero. In terms of memory savings, it makes sense to set the default value to the number which occurs most often in your data set. – derwiwie Aug 09 '16 at 14:08
12

Yes, Thierry's answer is definitely true I can say as co-author of the 'Matrix' package...

To your other question: Why is accessing "M" slower than "Y"? The main answer is that "M" is much much sparser than "Y" hence much smaller and -- depending on the sizes envolved and the RAM of your platform -- the access time is faster for much smaller objects, notably for indexing into them.

Martin Mächler
  • 4,619
  • 27
  • 27
  • Thanks! I look forward to seeing more of your answers on StackOverflow. I'll try and drum up some of the questions I've had while using Matrix... – Christopher DuBois Aug 24 '09 at 17:10
  • 10
    It is unfortunate that all non-zero cells are always stored. It would be nice to be able to specify a default value other than zero for a sparseMatrix. – Quantum7 May 06 '10 at 00:19
  • 1
    I am thinking about is there default value for sparseMatrix? – hs3180 Apr 26 '14 at 08:04
  • 1
    I agree with @Quantum7: In life sciences for example "0" does not always mean "no information". A pairwise similarity of 0 between two objects carries information that these are dissimilar. Whereas NA means we simply don't know how similar they are (often the case in biological data). It would make a lot of sense to not automatically equalize 0==missing and allow the user to pass in the desired default value (e.g. NA). In terms of memory improvements it should be the one that occurs most often in the data set. In my eyes, this is a limitation of your implementation and not a general thing. – derwiwie Aug 09 '16 at 14:17