0

I have a wide format data, I'm calling mlogit.data And I tried implementing a mixed logit model using mlogit package, I have one hot encoded the categorical columns (color,size_group ) is that causing the below error?

numerical features in model_data are log1p transformed.

Complete.choice <- mlogit.data(model_data, choice = "y", 
                                 varying = 2:79, shape = "wide", sep = "__", id = "customer_id")
formula <- as.formula("y ~ price + weight + length + height + width + color_white + 
                    color_red + color_black + size_group_1 + size_group_3 + size_group_5 + 
                     size_group_4 + size_group_2 | -1")

# rpar
 features <- c("price","weight","length","height","width","color_white",
              "color_red","color_black" ,"size_group_1",
              "size_group_3","size_group_5","size_group_4","size_group_2" )
random_parameter <- rep("n", 1:length(features))
names(random_parameter) <- features

sample.mxl <- mlogit(formula, Complete.choice , rpar = random_parameter, 
                       R = 40, halton = NA, panel = TRUE, seed = 123, print.level = 0)

Error in solve.default(H, g[!fixed]) : 
  system is computationally singular: reciprocal condition number = 3.23485e-18
Yashwanth
  • 69
  • 7
  • 1
    Please share a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), therefore please provide some sample data. – Martin Gal May 29 '20 at 08:59

1 Answers1

1

The error means that the Hessian matrix is singular, i.e. the determinant is zero, and the inverse doesn't exist. Effectively, you cannot obtain the variance-covariance matrix.

There are several reasons why this might happen:

  1. You don't have enough variation in your data to identify the model. You are trying to estimate one that is very complex and it would require a lot from your data (variation and observations).
  2. The model is over-specified (have you made the correct normalizations?)
  3. You are estimating 13 random parameters, which asks a lot from your data. I would start with a single random parameter and gradually increase to see when your model fails. Also with more than 4-5 random parameters, you shouldn't be using Halton draws, but would need some type of scrambling procedure. I would recommend scrambled Sobol draws, MLHS draws or scrambled Halton draws.
  4. You are only using R=40. This is a very low number. It will give a poor approximation to the multidimensional integral that is the mixed logit probability. The number of draws needed is increasing in complexity of the model, available alternatives etc. Many people think 500-1000 is good, whereas others tend to use 5000 or higher. Me, I start at a 1000 and gradually increase to where my parameters stabilize. Too few draws could also cause the error you are seeing.

It is impossible to diagnose the reason without testing on the actual data, but these are at least some pointers to get you started.

edsandorf
  • 757
  • 7
  • 17
  • Thanks for the explanation. How to add categorical features in mlogit model? Do I need to One Hot encode ? – Yashwanth May 29 '20 at 12:21
  • I am not too familiar with the `mlogit` package, but you can probably include them as factor variables in the model formula. The safest, I'd say, is to dummy-code them because it gives you full control over which category is ommited for identification. – edsandorf May 29 '20 at 12:57
  • For example, you can only estimate J-1 alternative specific constants, or if you have a categorical variable that you have turned into dummies, you need to ommit one category/dummy, or said another way, normalize/fix one parameter to zero. If you don't you will have perfect multicollinearity – edsandorf May 31 '20 at 04:27