Setting up an Mlogit in R with many observations for each category

Question

I'm trying to use Mlogit in R, I'm a little new to logits, and I'm having trouble setting up my problem in the Mlogit framework. I'm actually not entirely sure that mlogit is the right approach. Here is an analogous problem.

Consider a baseball dataset, with an outcome variable that takes on "out" "single" "double" "triple" and "homerun." For explanatory variables, we have the name of the batter, the name of the pitcher, and the stadium. There are hundreds of observations for each batter, including many with the batter facing the same pitcher.

I figured this is definitely a multinomial logit because I have multiple categorical outcomes, but I am not sure because all of the documentation seems to be dealing with "choices" between alternatives, which this isn't really. I tried to start my logit model by having a factor variable for the hitter, another one for the pitcher, and another one for the stadium. When I tried this in R, I get

Error in row.names<-.data.frame(*tmp*, value = value) : invalid 'row.names' length

With some googling I think maybe it is expecting only one observation for each combination of hitter, pitcher, and park? Maybe not? What am I doing wrong? How should I set this up?

Edit: Example of data here

https://docs.google.com/spreadsheets/d/19fiq_QEMj4nAPcTqIRxeaYNPgqeHxKAEuPrfHMeIJ7o/edit?usp=sharing

Please include a [reproducible example](https://stackoverflow.com/q/5963269/1222578) of your data and code, or it's hard for people to know what's going on. — Marius, Jul 04 '17 at 01:52
I'd like to add data, but how can I do that? Can I use a link to a google sheet? — Sam Asin, Jul 04 '17 at 02:33

score 1 · Answer 1 · answered Jul 04 '17 at 08:42

Here are some suggestions on how you can start analyzing your data.

# Your dataset
dts <- structure(list(outcome = c(1L, 1L, 2L, 3L, 1L, 3L, 2L, 3L, 3L, 
3L, 3L, 1L, 2L, 2L, 2L, 1L, 3L, 2L, 2L, 2L, 1L, 2L, 3L, 2L, 2L, 
2L, 2L, 1L, 1L, 2L, 3L, 2L, 3L, 1L, 2L, 2L, 3L, 2L, 3L, 3L, 3L, 
2L, 1L, 1L, 1L, 2L, 3L, 2L, 1L), hitter = structure(c(3L, 3L, 
3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("james", 
"jill", "john"), class = "factor"), pitcher = structure(c(3L, 
3L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 2L, 2L, 2L, 2L, 2L, 3L, 1L, 1L, 
2L, 2L, 3L, 3L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 2L, 1L, 2L, 3L, 2L, 
3L, 2L, 1L, 1L, 2L, 2L, 1L, 3L, 3L, 1L, 2L, 2L, 1L, 1L, 2L, 2L
), .Label = c("bill", "bob", "brett"), class = "factor"), place = structure(c(3L, 
3L, 3L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 5L, 
5L, 5L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L
), .Label = c("ca", "co", "dc", "ny", "tn"), class = "factor")), .Names = c("outcome", 
"hitter", "pitcher", "place"), class = "data.frame", row.names = c(NA, 
-49L))

# Estimation of a multinomial logistic regression model
library(mlogit)
dts.wide <- mlogit.data(dts, choice="outcome", shape="wide")
fit.mlogit <- mlogit(outcome ~ 1 | hitter+pitcher+place, data=dts.wide)

# Results
library(stargazer)
stargazer(fit.mlogit, type="text")

# Model coefficients with standard errors and statistical significance (stars)
==========================================
                   Dependent variable:    
               ---------------------------
                         outcome          
------------------------------------------
2:(intercept)            19.456           
                       (3,056.626)        

3:(intercept)            35.179           
                       (4,172.540)        

2:hitterjill             -17.543          
                       (3,056.625)        

3:hitterjill             -33.117          
                       (4,172.540)        

2:hitterjohn             -0.188           
                         (0.996)          

3:hitterjohn             -1.410           
                         (1.056)          

2:pitcherbob             -0.070           
                         (1.005)          

3:pitcherbob             -1.270           
                         (1.091)          

2:pitcherbrett           -0.908           
                         (1.063)          

3:pitcherbrett           -2.284*          
                         (1.257)          

2:placeco                -1.655           
                         (1.557)          

3:placeco                -17.688          
                       (2,840.270)        

2:placedc                -19.428          
                       (3,056.626)        

3:placedc                -34.479          
                       (4,172.540)        

2:placeny                -18.802          
                       (3,056.625)        

3:placeny                -32.873          
                       (4,172.540)        

2:placetn                -18.885          
                       (3,056.626)        

3:placetn                -32.140          
                       (4,172.540)        

------------------------------------------
Observations               49             
R2                        0.155           
Log Likelihood           -44.605          
LR Test             16.388 (df = 18)      
==========================================
Note:          *p<0.1; **p<0.05; ***p<0.01

More details on the estimation of multinomial logistic models in R are available here.

Thanks, could you talk me through why you set up the formula with the hitter pitcher and place to the right of the "|"? I am having trouble understanding my problem in the "alternative", "indivudal", "choice" framework that Mlogit wants. — Sam Asin, Jul 04 '17 at 18:27
The documentation says: "Data sets used for multinomial logit estimation deals with some individuals, that make one or a sequential choice of one alternative among a set of several alternatives." My dataset obviously doesn't have an individual making a choice at all, is it appropriate to even use these models? And how should I think about fitting it into that framework? — Sam Asin, Jul 04 '17 at 18:40
@SamAsin I realise that this seems rather strange but I am sure that it is the right way to estimate a multinomial logistic model using `mlogit` in the `mlogit` package. As an alternative, you can use the more "easy" `mlogit` function of `globaltest`. The formula is `outcome ~ hitter+pitcher+place`. — Marco Sandri, Jul 04 '17 at 19:20

Setting up an Mlogit in R with many observations for each category

1 Answers1