3

I am facing a problem I do not understand. It's a follow-up on answers suggested here and here

I have two identically structured datasets. One I created as a reproducible example for which the code works, and one which is real for which the code does not work. After staring at it for hours I cannot find what is causing the error. The following gives an example that works

    df <- data.table(cbind(rep(seq(1,25), each = 4 )), cbind(rep(seq(1,40), length.out = 100)))
    colnames(df) <- c("a", "b") #ignore warning
setkey(df, a, b)

This is just to create a reproducible example. When I apply the - slightly adjusted - code suggested in the mentioned SO articles I get what I am looking for: a sparse matrix that indicates when two elements in column b occur together for values of column a

library(Matrix)
s <- sparseMatrix(
  df$a,
  df$b,
    dimnames = list(
        unique(df$a),unique(df$b)), x = 1)
v <- t(s) %*% s

Now I am doing - in my eyes - exactly the same on my real dataset which is much longer.

A sample dput below looks like this

test <- dput(dk[1:50,])
structure(list(pid = c(204L, 204L, 207L, 254L, 254L, 258L, 258L, 
258L, 258L, 258L, 265L, 265L, 269L, 269L, 269L, 269L, 1520L, 
1520L, 1520L, 1520L, 1532L, 1532L, 1534L, 1534L, 1534L, 1534L, 
1539L, 1539L, 1543L, 1543L, 1546L, 1546L, 1546L, 1546L, 1546L, 
1546L, 1546L, 1549L, 1549L, 1549L, 1559L, 1559L, 1559L, 1559L, 
1559L, 1559L, 1559L, 1561L, 1561L, 1561L), cid = c(11023L, 11787L, 
14232L, 14470L, 14480L, 1290L, 1637L, 4452L, 13964L, 14590L, 
17814L, 23453L, 6658L, 10952L, 17259L, 27549L, 11034L, 22748L, 
23345L, 23347L, 10487L, 11162L, 15570L, 15629L, 17983L, 17999L, 
17531L, 22497L, 14425L, 14521L, 11495L, 24948L, 24962L, 24969L, 
24972L, 24973L, 30627L, 17886L, 18428L, 23972L, 13890L, 13936L, 
14432L, 21230L, 21271L, 21384L, 21437L, 341L, 354L, 6302L)), .Names = c("pid", 
"cid"), sorted = c("pid", "cid"), class = c("data.table", "data.frame"
), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000000100788>)

Then when running the same formula, I get an error

s <- sparseMatrix(test$pid,test$cid,dimnames = list(unique(test$pid), unique(test$cid)),x = 1)

The Error (which occurs in the test dataset as well) reads as follows:

Error in validObject(r) : 
  invalid class “dgTMatrix” object: length(Dimnames[[1]])' must match Dim[1]

The problem disappears when I remove the dimnames but I really need these dimnames to make sense of the results. I'm sure I'm missing out on something obvious. Can someone please tell me what it is ?

Community
  • 1
  • 1
SJDS
  • 1,239
  • 1
  • 16
  • 31

1 Answers1

1

We can convert the 'pid', 'cid' columns to factor and coerce back to numeric or use match with unique values of each column to get the row/column index and this should work in creating sparseMatrix.

test1 <- test[, lapply(.SD, function(x) 
                 as.numeric(factor(x, levels=unique(x))))]

Or we use match

test1 <- test[, lapply(.SD, function(x) match(x, unique(x)))]

s1 <- sparseMatrix(test1$pid,test1$cid,dimnames = list(unique(test$pid), 
                 unique(test$cid)),x = 1)
dim(s1)
#[1] 15 50

s1[1:3, 1:3]
#3 x 3 sparse Matrix of class "dgCMatrix"
#    11023 11787 14232
#204     1     1     .
#207     .     .     1
#254     .     .     .

head(test)
#   pid   cid
#1: 204 11023
#2: 204 11787
#3: 207 14232
#4: 254 14470
#5: 254 14480
#6: 258  1290

EDIT:

If we want this for the full row/column index specified in 'test', we need to make the dimnames as the same length as the max of 'pid', 'cid'

rnm <- seq(max(test$pid))
cnm <- seq(max(test$cid))
s2 <- sparseMatrix(test$pid, test$cid, dimnames=list(rnm, cnm))
dim(s2)
#[1]  1561 30627
s2[1:3, 1:3]
#3 x 3 sparse Matrix of class "ngCMatrix"
# 1 2 3
#1 . . .
#2 . . .
#3 . . .
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks @Akrun, this seems to work but I'm still puzzled. Is the reason this workaround is required because sparseMatrix requires variables that follow one another perfectly (1,2,3,4,5,6,7 ... as assigned factors do) and it does not work with less orderly numbers such as the ones I have in my 'dput' ? If so, is there any reason for this limitation ? – SJDS Sep 02 '15 at 13:07
  • @simon_icl The dimensions are not matching with the lengths of the dimnames. For example `dim(sparseMatrix(i=test$pid[1:5], j= test$cid[1:5], x=1)) #[1] 254 14480`, while we provide in the dimnames the ` length(unique(test$pid[1:5])) #[1] 3` It has to match. Another way would be to create dimnames as sequence of min:max of unique values of pid and for cid. – akrun Sep 02 '15 at 13:16
  • @simon_icl Updated the post. Can you check if that works? – akrun Sep 02 '15 at 13:25
  • I see this problem with the dimensions occurs but I don't understand why it is different between the two given examples. The only obvious difference seems to be in the order of the unique values for either dimension, which is from low to high in the first example and not so in the second example. Your solution works so I can move on but I'm trying to understand why sparseMatrix requires this order. Simply exchanging my solution with `list(ordered(unique(dk$pid)), ordered(unique(dk$cid)))` does not work... Also while I can replicate your result for the dimensions, how can they be so big ? – SJDS Sep 02 '15 at 13:45
  • @simon_icl It is big because the dims are calculated based on the max value in pid and cid. So, if you create one with example `sparseMatrix(i=c(215, 225), j= c(3,4), x=1)` the dimension will be `225*4` instead of `2x2` – akrun Sep 02 '15 at 13:49
  • aha that is suddenly very clear! Thanks for your patience and detailed explanation man! – SJDS Sep 02 '15 at 14:37
  • @simon_icl No problem. Glad to help you. – akrun Sep 02 '15 at 14:38