2

My problem has to do with persistent computational singularity when trying to use the mlogit package.

First, a little bit about my data:

My data concerns predicting choice in the context of a sports draft. Each team makes an ordered selection from the same pool of players, with team and player attributes. Thus, in the language of mlogit, each "team" is an individual, and each "player" an alternative. To provide an oversimplified example, say five teams each chose a player.

Pick      Player  PPG  Age Team
1    Ben Simmons 19.2  19  PHI
2 Brandon Ingram 17.3  18  PHI
3   Jaylen Brown 14.6  19  PHI
5      Kris Dunn 16.4  21  PHI
6    Buddy Hield 25.0  22  PHI

I'm attempting to use the mlogit package. I first use mlogit.data to reformat my data.

Choices <- mlogit.data(test,
     choice="picked",
     shape="long",
     id.var="Team",
     alt.var="Player",
     chid.var="Team",
     varying=c(4:5))

The result looking like:

                    picked Pick       Player  PPG  Age Team
PHI.Ben Simmons      TRUE    1    Ben Simmons 19.2  19  PHI
PHI.Brandon Ingram  FALSE    2 Brandon Ingram 17.3  18  PHI
PHI.Jaylen Brown    FALSE    3   Jaylen Brown 14.6  19  PHI
PHI.Kris Dunn       FALSE    5      Kris Dunn 16.4  21  PHI
PHI.Buddy Hield     FALSE    6    Buddy Hield 25.0  22  PHI
LAL.Brandon Ingram   TRUE    2 Brandon Ingram 17.3  18  LAL
LAL.Jaylen Brown    FALSE    3   Jaylen Brown 14.6  19  LAL
LAL.Kris Dunn       FALSE    5      Kris Dunn 16.4  21  LAL
LAL.Buddy Hield     FALSE    6    Buddy Hield 25.0  22  LAL
BOS.Jaylen Brown     TRUE    3   Jaylen Brown 14.6  19  BOS
BOS.Kris Dunn       FALSE    5      Kris Dunn 16.4  21  BOS
BOS.Buddy Hield     FALSE    6    Buddy Hield 25.0  22  BOS
MIN.Kris Dunn        TRUE    5      Kris Dunn 16.4  21  MIN
MIN.Buddy Hield     FALSE    6    Buddy Hield 25.0  22  MIN
NOP.Buddy Hield      TRUE    6    Buddy Hield 25.0  22  NOP

Obviously, I have a lot more players and variables but that's the basic structure.

I then try to run a conditional logit regression:

mlogit(Choices,picked ~ <regvar>,data=Choices)

I repeatedly encounter the following error:

Error in solve.default(H, g[!fixed]) : 
  system is computationally singular: reciprocal condition number = 3.72907e-23

A solution I have seen elsewhere suggests trying to eliminate the non-invertibility by removing highly correlated variables. However, this doesn't seem to solve my issue. The problem persists, with a different exact number, even in simple two-variable models like the example data with low correlations (obviously, with a different number). In fact, it even occurs with a single regressor!

Simplified version of what I'm trying to do:

model<-mlogit(
  picked~PPG+Age,
  data=Choices)

Perhaps it's just a tolerance issue but given that these variables are not especially correlated, that would be surprising. That would seem to be that something more subtle than variable correlation is at fault. I have also checked, and providing separated individual/alternative specific variables. For example, adding in a team-specific variable such as "Team_MSA_Size" does not change anything:

model<-mlogit(
  picked~PPG+Age|Team_MSA_Size,
  data=Choices)

Is there something about my data structure or a failure to use mlogit syntax correctly that is leading to this? How would I go about fixing it?

I did find this similar-seeming topic, but I did have trouble following it without the data in question. Is the accepted answer suggesting that each choice must always have the same exact alternatives? If so, that would be deeply unfortunate for me, since obviously each team sees a different list due to players being removed by selection. If that's the problem, is there an easy fix here or does that put it in code-your-own-estimator territory?

I can happily provide more data or other details if it would be helpful.

EDIT: Someone requested toy data. Here is a csv with toy data, and below error-producing code.

setwd("Filepath")
library(mlogit)
toy_data <- read.csv("toy_data.csv",header = TRUE)
Choices_test<- mlogit.data(toy_data,
                       choice="picked",
                       shape="long",
                       id.var="Team",
                       alt.var="Pick",
                       chid.var="Team")
mlogit(picked~as.factor(Position)+as.factor(Black)+Age+PPG+APG+RPG+Team_WS,
     data=Choices_test)


Error in solve.default(H, g[!fixed]) : 
  system is computationally singular: reciprocal condition number = 6.53305e-21
CRS1834
  • 111
  • 1
  • 6
  • this seems more appropriate for Cross-Validated, as it's likely a statistical issue rather than a programming one – MichaelChirico Aug 14 '17 at 04:52
  • In my experience, this is a result of some variable losing variation within a group when sufficiently conditioned -- have you tried adding one variable at a time to see when it breaks? if you can, explore the group-wise variation in this variable in more detail. – MichaelChirico Aug 14 '17 at 04:54
  • 1
    @MichaelChirico if I understand your question correctly, yes, I have looked into this. The problem persists even when I only include a single right hand side variable. It doesn't seem to matter what that variable is. The universality of the error is what makes me think it is something about the data format/ my use of mlogit syntax rather than the raw data. – CRS1834 Aug 14 '17 at 04:58
  • Looks like you have a lot of duplicate entries. Wouldn't that predispose you to singular X matrix? – IRTFM Aug 14 '17 at 06:01
  • 2
    I think the problem is that your choices are all unique and are therefore selected once and only once, so there's only one 1 for each level of the dependent variable. Multinomial logit is appropriate when you have a set of choices, each of which is chosen multiple times under varying conditions. – ulfelder Aug 14 '17 at 10:24
  • @42- yes and no. There are a lot of near-identical rows and if I try running it with only player-specific variables, there will be identical ones. However, there are also team-specific attributes and player-team match attributes. The inclusion of this means no rows are exactly duplicated, at least until a team picks for the second time. The problem persists in a subset of unique teams with those variables, so exact duplicates doesn't seem to be the answer. – CRS1834 Aug 14 '17 at 20:45
  • @ulfelder that was exactly the sort of thing I was wondering about. Is there a similar model that wouldn't require this? – CRS1834 Aug 14 '17 at 20:49
  • No model can learn from a bunch of unique outcomes, because there are no patterns in that scenario. If you can group choices (e.g., by position) or score their attributes (e.g., speed, offense vs. defense), then you've got something a model can learn from. – ulfelder Aug 14 '17 at 21:15
  • @ulfelder I think one of us is misunderstanding the other. Are you assuming that I'm just feeding in their names and outcomes? That certainly wouldn't work, agreed! But it isn't true; I have a broad set of right hand side variables for each observation. These include player attributes such as points per game, height, weight, etc., player positions, and team-player match characteristics (i.e. each team's needs at the players' position). So there certain would seem to be patterns and I have no trouble running a simple linear model on pick number. – CRS1834 Aug 14 '17 at 21:24
  • Basically the outcomes are "unique" only in that each player is only positively selected once but not in the sense of being completely idiosyncratic. To me, the mystery is why the conditional logit setting yields this problem when the seemingly similar setting of lm(Pick~,data=Choices) does not. – CRS1834 Aug 14 '17 at 21:26
  • Think of it this way: as far as your model is concerned, the selection of each player is perfectly predicted by that player's attributes. There is no variation on the predictors for positive outcomes, because there is only one set of predictor values associated with each outcome. Logistic regression (binary or multinomial) generally fails when that's the case. You have to generalize the problem to use multinomial logit here. – ulfelder Aug 15 '17 at 10:05

1 Answers1

0

If I performed conditional logistic regression using the function from mlogit pkg, adapted to your variables as I understand them:

 Choices <- mlogit.data(data=test, choice="picked",
          chid.var="team",alt.var="pick",shape="long")

Your alt.var is "pick" and it needs to be ordered, uniquely, no ties, so in your case it looks like 1:6.

Unless I have misunderstood your question!

If you create a small toy dataset so that code could be tested it would help.

Hi again,

Is your syntax:

mlogit(picked~as.factor(Position)+as.factor(Black)+Age+PPG+APG+RPG+Team_WS,
     data=Choices_test)

from the mlogit pkg, it's a while since I looked but it does not look like the syntax I use for conditional logistic regression.

The computational singularity error msg is to do with highly correlated/identical variables in my experience.

If you are certain there is little collinearity then my guess is that your syntax for the model is not set up correctly.

PS: Your code throws:

In file(file, "rt") :
  cannot open file 'toy_data.csv': No such file or directory

I think you need to fill 'toy_data.csv' with some rows of data. If your data is sensitive then just invent some (enough to demonstrate the problem).

MDEWITT
  • 2,338
  • 2
  • 12
  • 23
cousin_pete
  • 578
  • 4
  • 15
  • 1
    My understanding is that it shouldn't matter, based on the documentation - the pick number and player name both amount to the same unique labeled of "alternatives." Since each player represents a pick a team hypothetically could choose, they constitute alternatives. Using picks instead would amount to a simple relabeling, since each pick number is associated with a unique player. Is something about that wrong? – CRS1834 Aug 15 '17 at 04:47
  • Also, if I try switching that, it doesn't change the error message I get, but it does change the exact reciprocal condition number by a small amount. That would seem to suggest I must be wrong for thinking the two to be equivalent. – CRS1834 Aug 15 '17 at 04:50