How should I format my data for the R mlogit package?

Question

I am using the mlogit package with R.

After importing my data using:

t <-read.csv('junk.csv',header=TRUE, sep=",", dec=".")

and call:

x <- mlogit.data(t,choice="D",shape="long",id.var="key",alt.var="altkey")

I am getting the following error:

Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.1", "1.2", "1.3",  : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘1.1’, ‘1.2’, ‘1.3’, ‘1.4’, ‘1.5’, ‘1.6’

Any ideas how to fix it?

My data exist in the following format in a csv file:

[junk.csv]

key,altkey,A,B,C,D
201005131,1,2.6,118.17,117,0
201005131,2,1.4,117.11,115,0
201005131,3,1.1,117.38,122,1
201005131,4,24.6,,122,0
201005131,5,48.6,91.90,122,0
201005131,6,59.8,,122,0
201005132,1,20.2,118.23,113,0
201005132,2,2.5,123.67,120,1
201005132,3,7.4,116.30,120,0
201005132,4,2.8,118.86,120,0
201005132,5,6.9,124.72,120,0
201005132,6,2.5,123.81,120,0
201005132,7,8.5,119.23,115,

Andrie · Accepted Answer · 2017-07-29T20:40:26.573

My experience of mlogit is that it isn't very forgiving about data that isn't exactly the way it should be.

In your case, I notice that the first respondent has 6 alternatives, while the second respondent has 7 alternatives. If you format your data to have an equal number of alternatives for each respondent the mlogit.data function works:

dat <- read.table(sep=",",text="
key,altkey,A,B,C,D
201005131,1, 2.6,118.17,117,0
201005131,2,1.4,117.11,115,0
201005131,3,1.1,117.38,122,1
201005131,4,24.6,,122,0
201005131,5,48.6,91.90,122,0
201005131,6,59.8,,122,0
201005132,1,20.2,118.23,113,0
201005132,2,2.5,123.67,120,1
201005132,3,7.4,116.30,120,0
201005132,4,2.8,118.86,120,0
201005132,5,6.9,124.72,120,0
201005132,6,2.5,123.81,120,0
201005132,7,8.5,119.23,115,0
", header=TRUE)

Running mlogit on all of the data reproduces the error:

> mlogit.data(dat, choice="D", shape="long", id.var="key", alt.var="altkey")
Error in `row.names<-.data.frame`(`*tmp*`, value = c("1.1", "1.2", "1.3",  : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': '1.1', '1.2', '1.3', '1.4', '1.5', '1.6'

However, removing line number 13, i.e. the 7th alternative, works:

> mlogit.data(dat[-13, ], choice="D", shape="long", id.var="key", alt.var="altkey")
          key altkey    A      B   C     D
1.1 201005131      1  2.6 118.17 117 FALSE
1.2 201005131      2  1.4 117.11 115 FALSE
1.3 201005131      3  1.1 117.38 122  TRUE
1.4 201005131      4 24.6     NA 122 FALSE
1.5 201005131      5 48.6  91.90 122 FALSE
1.6 201005131      6 59.8     NA 122 FALSE
2.1 201005132      1 20.2 118.23 113 FALSE
2.2 201005132      2  2.5 123.67 120  TRUE
2.3 201005132      3  7.4 116.30 120 FALSE
2.4 201005132      4  2.8 118.86 120 FALSE
2.5 201005132      5  6.9 124.72 120 FALSE
2.6 201005132      6  2.5 123.81 120 FALSE

Of course, this isn't very satisfactory, since it destroys some of the data. A better solution is to construct the data in a format that mlogit() expects, and then call mlogit() directly:

dat$key <- factor(as.numeric(as.factor(dat$key)))
dat$altkey <- as.factor(dat$altkey)
dat$D <- as.logical(dat$D)
row.names(dat) <- paste(dat$key, dat$altkey, sep = ".")

Now the data looks like this:

    key altkey    A      B   C     D
1.1   1      1  2.6 118.17 117 FALSE
1.2   1      2  1.4 117.11 115 FALSE
1.3   1      3  1.1 117.38 122  TRUE
1.4   1      4 24.6     NA 122 FALSE
1.5   1      5 48.6  91.90 122 FALSE
1.6   1      6 59.8     NA 122 FALSE
2.1   2      1 20.2 118.23 113 FALSE
2.2   2      2  2.5 123.67 120  TRUE
2.3   2      3  7.4 116.30 120 FALSE
2.4   2      4  2.8 118.86 120 FALSE
2.5   2      5  6.9 124.72 120 FALSE
2.6   2      6  2.5 123.81 120 FALSE
2.7   2      7  8.5 119.23 115 FALSE

And you can call mlogit() directly:

mlogit(D ~ A + B + C, dat, 
       chid.var = "key", 
       alt.var = "altkey", 
       choice = "D", 
       shape = "long")

Result:

Call:
mlogit(formula = D ~ A + B + C, data = dat, chid.var = "key",     alt.var = "altkey", choice = "D", shape = "long", method = "nr",     print.level = 0)

Coefficients:
2:(intercept)  3:(intercept)  4:(intercept)  5:(intercept)  6:(intercept)  
      10.7774         4.8129         5.2257       -17.2522        -7.7364  
7:(intercept)              A              B              C  
      10.0389         1.6010         2.7156         2.9888

Thanks for the hint... Is it possible though to pass a multi-sized set of alternatives? — JohnP, Feb 20 '12 at 09:04
The answer is yes and no. I took another look at `mlogit.data` and the code assumes that the alternatives for each respondent contain the full set. This is partly why I never use `mlogit.data`, but construct the long form data myself. The function `mlogit` that fits the model can deal with the type of data you describe. — Andrie, Feb 20 '12 at 09:18
@Andrie Like the OP, I have data with an unequal amount of options per choice, a column which indicates the choice taken, and a separate column that labels the choices. It isn't very clear how to just apply `mlogit()` to this data, as you suggest. — sautedman, Jul 19 '17 at 22:00
@sautedman I suggest you ask a new question, with a reproducible example — Andrie, Jul 25 '17 at 08:11
@Andrie The OP's request (and my request) are fully within the scope of this question. Your answer just says to eliminate part of the data, which is destructive. You indicate that there is a better solution by passing data directly to `mlogit()`, but do not elaborate further. I think your answer would be much better if you expanded on this point. — sautedman, Jul 29 '17 at 18:38
mlogit.data works with unequal number of alternatives if you changed the argument in `mlogit.data(..., id.var="key"...)` to `mlogit.data(..., chid.var="key"...)` (as you did when you call mlogit with the data.frame you constructed manually). — LmW., Sep 05 '18 at 20:23

How should I format my data for the R mlogit package?

1 Answers1

Linked