I am facing a problem I do not understand. It's a follow-up on answers suggested here and here
I have two identically structured datasets. One I created as a reproducible example for which the code works, and one which is real for which the code does not work. After staring at it for hours I cannot find what is causing the error. The following gives an example that works
df <- data.table(cbind(rep(seq(1,25), each = 4 )), cbind(rep(seq(1,40), length.out = 100)))
colnames(df) <- c("a", "b") #ignore warning
setkey(df, a, b)
This is just to create a reproducible example. When I apply the - slightly adjusted - code suggested in the mentioned SO articles I get what I am looking for: a sparse matrix that indicates when two elements in column b occur together for values of column a
library(Matrix)
s <- sparseMatrix(
df$a,
df$b,
dimnames = list(
unique(df$a),unique(df$b)), x = 1)
v <- t(s) %*% s
Now I am doing - in my eyes - exactly the same on my real dataset which is much longer.
A sample dput
below looks like this
test <- dput(dk[1:50,])
structure(list(pid = c(204L, 204L, 207L, 254L, 254L, 258L, 258L,
258L, 258L, 258L, 265L, 265L, 269L, 269L, 269L, 269L, 1520L,
1520L, 1520L, 1520L, 1532L, 1532L, 1534L, 1534L, 1534L, 1534L,
1539L, 1539L, 1543L, 1543L, 1546L, 1546L, 1546L, 1546L, 1546L,
1546L, 1546L, 1549L, 1549L, 1549L, 1559L, 1559L, 1559L, 1559L,
1559L, 1559L, 1559L, 1561L, 1561L, 1561L), cid = c(11023L, 11787L,
14232L, 14470L, 14480L, 1290L, 1637L, 4452L, 13964L, 14590L,
17814L, 23453L, 6658L, 10952L, 17259L, 27549L, 11034L, 22748L,
23345L, 23347L, 10487L, 11162L, 15570L, 15629L, 17983L, 17999L,
17531L, 22497L, 14425L, 14521L, 11495L, 24948L, 24962L, 24969L,
24972L, 24973L, 30627L, 17886L, 18428L, 23972L, 13890L, 13936L,
14432L, 21230L, 21271L, 21384L, 21437L, 341L, 354L, 6302L)), .Names = c("pid",
"cid"), sorted = c("pid", "cid"), class = c("data.table", "data.frame"
), row.names = c(NA, -50L), .internal.selfref = <pointer: 0x0000000000100788>)
Then when running the same formula, I get an error
s <- sparseMatrix(test$pid,test$cid,dimnames = list(unique(test$pid), unique(test$cid)),x = 1)
The Error (which occurs in the test
dataset as well) reads as follows:
Error in validObject(r) :
invalid class “dgTMatrix” object: length(Dimnames[[1]])' must match Dim[1]
The problem disappears when I remove the dimnames
but I really need these dimnames to make sense of the results. I'm sure I'm missing out on something obvious. Can someone please tell me what it is ?