R: how to format my data for multinomial logit?

Question

I am reproducing some Stata code on R and I would like to perform a multinomial logistic regression with the mlogit function, from the package of the same name (I know that there is a multinom function in nnet but I don't want to use this one).

My problem is that, to use mlogit, I need my data to be formatted using mlogit.data and I can't figure out how to format it properly. Comparing my data to the data used in the examples in the documentation and in this question, I realize that it is not in the same form.

Indeed, the data I use is like:

df <- data.frame(ID = seq(1, 10),
                 type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
                 age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
                 dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
                 dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))

   ID type age dum1 dum2
1   1    2  28    1    1
2   2    3  31    0    0
3   3    4  12    0    1
4   4    2   1    0    1
5   5    1  49    0    0
6   6    1  80    1    0
7   7    4  36    0    1
8   8    1  53    1    0
9   9    3  22    1    1
10 10    2  10    0    0

whereas the data they use is like:

         key altkey    A      B   C D
1  201005131      1  2.6 118.17 117 0
2  201005131      2  1.4 117.11 115 0
3  201005131      3  1.1 117.38 122 1
4  201005131      4 24.6     NA 122 0
5  201005131      5 48.6  91.90 122 0
6  201005131      6 59.8     NA 122 0
7  201005132      1 20.2 118.23 113 0
8  201005132      2  2.5 123.67 120 1
9  201005132      3  7.4 116.30 120 0
10 201005132      4  2.8 118.86 120 0
11 201005132      5  6.9 124.72 120 0
12 201005132      6  2.5 123.81 120 0

As you can see, in their case, there is a column altkey that details every category for each key and there is also a column D showing which alternative is chosen by the person.

However, I only have one column (type) which shows the choice of the individual but does not show the other alternatives or the value of the other variables for each of these alternatives. When I try to apply mlogit, I have:

library(mlogit)
mlogit(type ~ age + dum1 + dum2, df)

Error in data.frame(lapply(index, function(x) x[drop = TRUE]), row.names = rownames(mydata)) : row names supplied are of the wrong length

Therefore, how can I format my data so that it corresponds to the type of data mlogit requires?

Edit: following the advices of @edsandorf, I modified my dataframe and mlogit.data works but now all the other explanatory variables have the same value for each alternative. Should I set these variables at 0 in the rows where the chosen alternative is 0 or FALSE ? (in fact, can somebody show me the procedure from where I am to the results of the mlogit because I don't get where I'm wrong for the estimation?)

The data I show here (df) is not my true data. However, it is exactly the same form: a column with the choice of the alternative (type), columns with dummies and age, etc.

Here's the procedure I've made so far (I did not set the alternatives to 0):

# create a dataframe with all alternatives for each ID
qqch <- data.frame(ID = rep(df$ID, each = 4),
                   choice = rep(1:4, 10))

# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")

# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
  df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}

# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)

    ID choice  type age dum1 dum2
1.1  1      1 FALSE  28    1    1
1.2  1      2  TRUE  28    1    1
1.3  1      3 FALSE  28    1    1
1.4  1      4 FALSE  28    1    1
2.1  2      1 FALSE  31    0    0
2.2  2      2 FALSE  31    0    0

If I do :

mlogit(type ~ age + dum1 + dum2, df3)

I have the error:

Error in solve.default(H, g[!fixed]) : system is computationally singular: reciprocal condition number

In order to apply a multinomial logit model you need information on the chosen and non-chosen alternatives. It appears you can only observe the chosen alternatives. Can you assume that everyone faced a choice between the same alternatives? Because it might be possible to recreate the non-chosen based on the choices of everyone else. — edsandorf, Dec 05 '19 at 21:44
@edsandorf yes, everyone had a choice, my mistake. In fact, I thought that the mlogit function in R functioned the same way as in Stata — bretauv, Dec 05 '19 at 21:47
so should I add every alternative for each ID and add a column specifying which alternative was chosen? — bretauv, Dec 05 '19 at 21:49
Yes. So you will have one column indicating the individual decision maker, one indicating the choice occasion (if you observe more than one choice per decision maker), one indicating each alternative available and one indicating the choice. Your data will then be in what is called the long format. You still need to run `mlogit.data()` to add the additional attributes to the data that the `mlogit()` function requires. — edsandorf, Dec 05 '19 at 21:52
No, you don't want to set any of the values to zero. How did you define the non-chosen alternatives for each individual? Why would they be exactly the same as the chosen one? It is very hard to say anything else without seeing the data (knowing the nature of the data). Is the example you provided from your real data? — edsandorf, Dec 06 '19 at 07:11
The data I show here (```df```) is not my true data. However, it is exactly the same form: a column with the choice of the alternative (```type```), columns with dummies and age, etc. I have edited my post to show what I've done so far — bretauv, Dec 06 '19 at 08:51
In this particular case, you get the error because the inverse of the Hessian doesn't exist. This is caused by no variation in your data, i.e. your chosen and non-chosen alts are the same. Let me ask a few more clarifying questions having seen how you generate your data. 1) Is the age with regards to the type, i.e. alternative specific or with respect to the decision maker, i.e. individual specific? 2) my previous question about the types. Is for example, type 2 always the same regardless of the decision maker? 3) Can all decision makers choose between all the types? — edsandorf, Dec 06 '19 at 11:01
1) the age is individual specific: it varies across individuals but does not depend on the type chosen. ```dum1``` and ```dum2``` work the same way: they vary across individuals but not across choices ; 2) a little bit more context: every individual makes a choice over 4 professional programs. Therefore, every type is the same regardless of the individuals / decision makers ; 3) every decision maker can choose any of the 4 alternatives. The variable type reflects the choice made by each decision maker — bretauv, Dec 06 '19 at 11:44

score 2 · Accepted Answer · answered Dec 07 '19 at 07:55

Your data doesn't lend itself well to be estimated using an MNL model unless we make more assumptions. In general, since all your variables are individual specific and does not vary across alternatives (types), the model cannot be identified. All of your individual specific characteristics will drop out unless we treat them as alternative specific. By the sounds of it, each professional program carries meaning in an of itself. In that case, we could estimate the MNL model using constants only, where the constant captures everything about the program that makes an individual choose it.

library(mlogit)
df <- data.frame(ID = seq(1, 10),
                 type = c(2, 3, 4, 2, 1, 1, 4, 1, 3, 2),
                 age = c(28, 31, 12, 1, 49, 80, 36, 53, 22, 10),
                 dum1 = c(1, 0, 0, 0, 0, 1, 0, 1, 1, 0),
                 dum2 = c(1, 0, 1, 1, 0, 0, 1, 0, 1, 0))

Now, just to be on the safe side, I create dummy variables for each of the programs. type_1 refers to program 1, type_2 to program 2 etc.

qqch <- data.frame(ID = rep(df$ID, each = 4),
                   choice = rep(1:4, 10))

# merge both dataframes
df2 <- dplyr::left_join(qqch, df, by = "ID")

# change the values in stype by 1 or 0
for (i in 1:length(df2$ID)){
  df2[i, "type"] <- ifelse(df2[i, "type"] == df2[i, "choice"], 1, 0)
}

# Add alternative specific variables (here only constants)
df2$type_1 <- ifelse(df2$choice == 1, 1, 0)
df2$type_2 <- ifelse(df2$choice == 2, 1, 0)
df2$type_3 <- ifelse(df2$choice == 3, 1, 0)
df2$type_4 <- ifelse(df2$choice == 4, 1, 0)

# format for mlogit
df3 <- mlogit.data(df2, choice = "type", shape = "long", alt.var = "choice")
head(df3)

Now we can run the model. I include the dummies for each of the alternatives keeping alternative 4 as my reference level. Only J-1 constants are identified, where J is the number of alternatives. In the second half of the formula (after the pipe operator), I make sure that I remove all alternative specific constants that the model would have created and I add your individual specific variables, treating them as alternative specific. Note that this only makes sense if your alternatives (programs) carry meaning and are not generic.

model <- mlogit(type ~ type_1 + type_2 + type_3 | -1 + age + dum1 + dum2,
                reflevel = 4, data = df3)
summary(model)

that is perfect, thanks for the advices and the clear explanations ! if you know how to tidy the results so that they are grouped by alternative, can you please add it in your post? — bretauv, Dec 07 '19 at 08:19

R: how to format my data for multinomial logit?

1 Answers1

Linked