13

I have a very large matrix I'm trying to run through glmnet on a server with plenty of memory. It works fine even on very large data sets up to a certain point, after which I get the following error:

Error in elnet(x, ...) : long vectors (argument 5) are not supported in .C

If I understand correctly this is caused by a limitation in R which cannot have any vector with length longer than INT_MAX. Is that correct? Are there any available solutions to this that don't require a complete rewrite of glmnet? Do any of the alternative R interpreters (Riposte, etc) address this limitation?

Thanks!

Dmitriy Selivanov
  • 4,545
  • 1
  • 22
  • 38
Danny
  • 3,077
  • 2
  • 23
  • 26
  • During your code, do you perform a subsetting of the matrix? I might be wrong but you cannot perform a matrix subsetting if the matrix have more than 36 billion of elements. In that case you have to subset matrix as if it was a huge atomic vector (which in fact it is since a matrix is just a vecotr with a dimesion attribute). – SabDeM Oct 18 '16 at 17:11
  • Throughout my code I am using a file backed bigmatrix to avoid these problems, but when I run glmnet I have to pass it as an R matrix like this: `theMatrix[,]`. – Danny Oct 18 '16 at 22:04
  • 2
    Hi Danny. My comment is not directly related to question, but mb it will help. Take a look to pirls package by Michael Kane - https://github.com/kaneplusplus/pirls. Mb this solver works with long vectors. – Dmitriy Selivanov Oct 22 '16 at 11:50
  • 1
    The problem really is that the underlying design in glmnet, and its use of the (effectively deprecated and discouraged `.C()`) interface. Mike Kane had a good hard look at this is pirls should indeed offer something. It is of course smaller/younger/less well tested so YMMV. – Dirk Eddelbuettel Oct 23 '16 at 00:57
  • Just discovered another very promising package - https://github.com/jaredhuling/oem – Dmitriy Selivanov Oct 25 '16 at 07:51

2 Answers2

11

Since version 3 R supports long vectors. A long vector is indexed by double. A long vector can be a base for a matrix or a more-than-2 dimensional array as long as each dimension is small enough to be indexable by an integer. Long vectors cannot be passed to native code via .C and .Fortran. The error message you are getting is because a long vector is being passed via .C.

Long vectors can be passed via .Call. So, as long as the native code of glmnet could support long vectors (64 bit indexes) or could be modified/compiled to support it, one only would have to modify the interface between R and native code of glmnet. You can do this manually in C and there is also a new package named dotCall64 for this task. Part of modifying the interface is deciding when to copy arguments - .C/.Fortran preventively copies, but you don't want to do this unnecessarily with large data structures.

I think the difficulty of changing the native code of glmnet to support 64 bit indexes depends on the actual code (that I only looked at but never worked with). It is easy to switch all integers (or explicitly or implicitly 32-bit integers) in Fortran code to 64-bit. The troubles come when some integers have to stay 32 bit, and this will happen e.g. for integer vectors passed from/to R code, because R uses 32 bit integers (even in long vectors indeed). There are such integer vectors passed in glmnet. How hard is the modification then depends on how clean is the original Fortran code (e.g. if it uses separate integer variables for indexing and accessing values of integer arrays, etc).

Experimental implementations of subsets of R, like Riposte, will not help.

Tomas Kalibera
  • 1,061
  • 9
  • 13
  • Thanks for the info, but some digging around seems to indicate that to switch from .C to .Call requires major changes to the underlying Fortran code. That's exactly what I'm trying to avoid. Sounds like there simply may not be a solution that fits my needs. – Danny Oct 19 '16 at 15:50
  • 1
    I've updated my response. I think the difficulty depends on the actual code so people who worked with that code could give best answer. My guess: after a day or two of programming you would either have it done or have a good estimate. Certainly this should not be a complete rewrite. – Tomas Kalibera Oct 20 '16 at 06:06
  • 1
    That seems to have done it! The key for me was the dotCall64 package. Using .Call directly was a bit beyond the time and complexity I have time for now, but with dotCall64 I simply had to replice the .Fortran calls and add a list of data types for the input variables. Identifying the correct data types took a bit of time but wasn't too difficult. There are still some problems with memory, but I think I'll be able to work around them. Thanks so much Tomas! – Danny Oct 24 '16 at 14:03
2

There is a note in ?"long vector" which states:

However, compiled code typically needs quite extensive changes. Note that the .C and .Fortran interfaces do not accept long vectors, so .Call (or similar) has to be used.

elnet makes .Fortran calls. You would have to modify the function to use .Call, perhaps via a C wrapper that calls the FORTRAN code, and possibly rewrite and compile the relevant FORTRAN code to deal with long vectors.

James
  • 65,548
  • 14
  • 155
  • 193
  • Thanks for the info, but some digging around seems to indicate that to switch from .C to .Call requires major changes to the underlying Fortran code. That's exactly what I'm trying to avoid. Sounds like there simply may not be a solution that fits my needs. – Danny Oct 19 '16 at 15:50
  • No, if the underlying code is wedded to 32-bit vectors, I'm afraid you are stuck with it. – James Oct 19 '16 at 20:37