0

I am attempting to run a classification algorithm for a dataset with no missing values. Here is the dataset description:

'data.frame':   59977 obs. of  6 variables:
 $ gender      : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 1 1 2 2 ...
 $ age         : num  35.7 35.7 35.7 35.7 35.7 ...
 $ code        : Factor w/ 492 levels "ADN105","AXN16B",..: 128 128 128 363 363 363 104 104 221 221 ...
 $ totalflags  : num  4 4 4 4 4 4 3 3 2 2 ...
 $ measure2    : num  30 30 30 1 1 1 23 23 22 22 ...
 $ outcome     : num  1 1 1 0 0 0 1 1 1 1 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:138] 3718 3719 5493 5494 5495 5496 7302 7303 8415 8416 ...
  .. ..- attr(*, "names")= chr [1:138] "4929" "4930" "7384" "7385" ...

When I run the following command

x <- Mydataset[,1:5]
y <- Mydataset[,6]
fit <- glmnet(x, y, family="binomial", alpha=0.5, lambda=0.001)

I get

Error in lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  : 
  NA/NaN/Inf in foreign function call (arg 5)
In addition: Warning message:
In lognet(x, is.sparse, ix, jx, y, weights, offset, alpha, nobs,  :
  NAs introduced by coercion

Before running the glm model, I did this:

Mydataset <- na.omit(Mydataset)

And checked to make sure no NA's exist:

sapply(Mydataset, function(y) sum(length(which(is.na(y)))))

and I got:

gender          age       code totalflags     measure2   outcome


 0            0            0            0            0            0 

I looked at other questions for couldn't find anything relevant. Appreciate any thoughts and help in this

EDIT: ANSWER

I did a little digging and decided to change the data frame to numeric matrix and the model ran without complaining. This is the code that helped me:

x <- data.matrix(Mydataset[,1:5])
y <- data.matrix(Mydataset[,6])
John Doe
  • 55
  • 8
  • 1
    Possible duplicate of https://stackoverflow.com/questions/21858124/r-error-in-glmnet-na-nan-inf-in-foreign-function-call – akrun Dec 28 '17 at 19:37
  • I see someone commenting that they get the same error without any NA values but no solution is available for their comment. My dataset has no NA values at all. – John Doe Dec 28 '17 at 20:02
  • So you checked for NA but the error message said one of `NA/NaN/Inf`. A common reason for Inf is division by 0 and a common reason for NaN is log(0). You should also refactor the `na.omit`-ted dataframe to remove an non-existent levels. – IRTFM Dec 28 '17 at 20:28

1 Answers1

0

The most likely cause is small or zero numbers of factor variables within one or more levels. Try this first:

 Mydataset [ c('gender',  'code') ] <- 
                             lapply( Mydataset [ c('gender',  'code') ], factor)

If that's not effective then you should show the actual code used and better description and names of all objects used. At the moment we don't even know what are x and y.

EDIT: The glmnet function does not have a formula interface and is not set up to handle data.frames and factors the way that typical R regression functions would allow. After looking at the structure of x (still a list/dataframe) and reviewing the help page for ?glmnet and doing a bit of searching for the correct way to handle factors when a numeric matrix is the expected input, I suggest converting your factors to dummies with model.matrix. It's going to be easier for interpretation of the results if you change the default contrast scheme for treatment contrasts (See https://stats.stackexchange.com/questions/69804/group-categorical-variables-in-glmnet):

contr.Dummy <- function(contrasts, ...){
   conT <- contr.treatment(contrasts=FALSE, ...)
   conT
}
options(contrasts=c(ordered='contr.Dummy', unordered='contr.Dummy'))

x.m <- model.matrix( ~.-1, x)
fit <- glmnet(x=x.m, y, family="binomial", alpha=0.5, lambda=0.001)
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Hi @42-, apologies, made a typo in my answer -- I used data.matrix(Mydataset) and not data.frame(Mydataset). – John Doe Dec 29 '17 at 20:25