I'm using glmnet and glmnetcr to fit ordinal regression models.
Unfortunately, my model matrix is ~640000 * 5000. This is larger than can be stored in a 32bit integer and I'm running into the same problem others have described: R vector size limit: "long vectors (argument 5) are not supported in .C"
If I only use half of the data, I can run this on my local server with plenty of memory and have no problems.
I've attempted to implement the 'solution' in the above post by using the dotCall64 package. I've replaced the .Fortran calls with .C64 and specified the data type for each variable. However, each time I run my code I either get nonsensical lambda values (9.9e35) or segfaults such as:
* caught segfault * address 0x1511aaeb0, cause 'memory not mapped'
Which one I get and the exact address varies each time so I assume I'm doing something wrong in implementing this solution.
Here is the code so far in the function lognet() (the function ultimately called by glmnetcr and glmnet and passes the variable to the fortran code)
Original code in lognet()
.Fortran("lognet", parm = alpha, nobs, nvars, nc, as.double(x),
y, offset, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh,
isd, intr, maxit, kopt, lmu = integer(1), a0 = double(nlam *
nc), ca = double(nx * nlam * nc), ia = integer(nx),
nin = integer(nlam), nulldev = double(1), dev = double(nlam),
alm = double(nlam), nlp = integer(1), jerr = integer(1),
PACKAGE = "glmnet")
Modified code in lognet()
.C64("lognet", SIGNATURE = c("double","int", "int", "int", "int64",
"double","double","int", "double","double"
"int", "int", "int", "double","double",
"double","int", "int", "int", "int",
"int", "double","double","int", "int",
"double","double","double","int", "int"),
parm = alpha, nobs, nvars, nc, as.double(x),
y, offset, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh,
isd, intr, maxit, kopt, lmu = integer(1), a0 = double(nlam * nc), ca = double(nx * nlam * nc), ia = integer(nx),
nin = integer(nlam), nulldev = double(1), dev = double(nlam),
alm = double(nlam), nlp = integer(1), jerr = integer(1),
PACKAGE = "glmnet")
Toy example (data much smaller than actual)
library(glmnetcr)
library(dotCall64)
x1 <- cbind(c(0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1),c(0,0,0,1,0,1,1,1,0,0,0,0,0,1,1,1),c(0,0,1,0,1,0,1,1,0,0,0,0,1,0,1,1),c(0,1,0,0,1,1,0,1,0,0,0,0,1,1,0,1),c(0,0,0,0,0,0,1,1,0,0,0,0,0,0,1,1),c(0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,1),c(0,0,0,0,1,0,0,1,0,0,0,0,1,0,0,1))
y1 <- c(0,0,0,1,1,1,2,2,0,1,0,1,1,2,1,2)
testA <- glmnetcr(x=x1,y=y1,method = "forward", nlambda=10,lambda.min.ratio=0.001, alpha =1,maxit = 500,standardize=FALSE)
Running this with the original lognet() code produces no problems. Running it with the modified lognet() code causes odd lambda value estimates and/or segfaults (seems to be random which one happens). My first guess is that I have one of the variables typed incorrectly, but I've went through everything twice and can't see the problem. The other option is that the underlying fortran code can't handle 64bit integers. I know zero fortran and am not even sure how to begin fixing the problem if this is the case.