4

I am using R to replicate a study and obtain mostly the same results the author reported. At one point, however, I calculate marginal effects that seem to be unrealistically small. I would greatly appreciate if you could have a look at my reasoning and the code below and see if I am mistaken at one point or another.

My sample contains 24535 observations, the dependent variable "x028bin" is a binary variable taking on the values 0 and 1, and there are furthermore 10 explaining variables. Nine of those independent variables have numeric levels, the independent variable "f025grouped" is a factor consisting of different religious denominations.

I would like to run a probit regression including dummies for religious denomination and then compute marginal effects. In order to do so, I first eliminate missing values and use cross-tabs between the dependent and independent variables to verify that there are no small or 0 cells. Then I run the probit model which works fine and I also obtain reasonable results:

probit4AKIE <- glm(x028bin ~ x003 + x003squ + x025secv2 + x025terv2 + x007bin + x04chief + x011rec + a009bin + x045mod + c001bin + f025grouped, family=binomial(link="probit"), data=wvshm5red2delna, na.action=na.pass)

summary(probit4AKIE)

However, when calculating marginal effects with all variables at their means from the probit coefficients and a scale factor, the marginal effects I obtain are much too small (e.g. 2.6042e-78). The code looks like this:

ttt <- cbind(wvshm5red2delna$x003,
wvshm5red2delna$x003squ,
wvshm5red2delna$x025secv2,
wvshm5red2delna$x025terv2,
wvshm5red2delna$x007bin,
wvshm5red2delna$x04chief,
wvshm5red2delna$x011rec,
wvshm5red2delna$a009bin,
wvshm5red2delna$x045mod,
wvshm5red2delna$c001bin,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped,
wvshm5red2delna$f025grouped) #I put variable "f025grouped" 9 times because this variable consists of 9 levels

ttt <- as.data.frame(ttt)

xbar <- as.matrix(mean(cbind(1,ttt[1:19]))) #1:19 position of variables in dataframe ttt

betaprobit4AKIE <- probit4AKIE$coefficients

zxbar <- t(xbar) %*% betaprobit4AKIE

scalefactor <- dnorm(zxbar)

marginprobit4AKIE <- scalefactor * betaprobit4AKIE[2:20] #2:20 are the positions of variables in the output of the probit model 'probit4AKIE' (variables need to be in the same ordering as in data.frame ttt), the constant in the model occupies the first position

marginprobit4AKIE #in this step I obtain values that are much too small

I apologize that I can not provide you with a working example as my dataset is much too large. Any comment would be greatly appreciated. Thanks a lot.

Best,

Tobias

jay.sf
  • 60,139
  • 8
  • 53
  • 110
Tobias
  • 45
  • 1
  • 3
  • 1
    I think you'd be better off posting on Crossvalidated, the stats sister site to SO: http://stats.stackexchange.com/ – Gavin Simpson Apr 27 '11 at 10:48
  • I assume you know that the marginal effect of a probit variable depends of the value of the variable. Since your variables are categorical, maybe it doesn't make sense to use mean values for them. – Manoel Galdino Apr 27 '11 at 15:51
  • @Manoel Galdino I was also wondering whether it is valid to use mean values for categorial variables. However, the author of the study I am replicating apparently does it. Moreover, on this site I adapted my code from [link](http://people.tamu.edu/~b-wood/Essex/RLesson%206.htm) the author does the same thing.But thanks for your comment, I modified my code such that `wvshm5red2delna$f025grouped=="Buddhism", wvshm5red2delna$f025grouped=="Hinduism",` etc. and now I obtain approx. the same results as the author – Tobias Apr 28 '11 at 09:24
  • @Tobias, I do think that it's better to use modal values rather than mean values for categorical variables. For instance, if the categorical variable is sex, what does the average value mean? If, however, you have some categorical variable that is a continuous one, but discretized (like ctegories of income), then maybe it's ok to use the avereage value of it. – Manoel Galdino Apr 29 '11 at 20:22
  • @Tobias. Reading you comment again, I think I realize what's going wrong. If your categorical variables is as factor in R, then when you compute the mean (xbar), maybe R is converting the factor to numeric, and since you have more than one category, teh mean is something like 4 or 5. Take a look at the values of `xbar´ at your old code. – Manoel Galdino Apr 29 '11 at 20:30
  • @Manoel Galdino, concerning your second last comment: it makes some sense to me to use modal values for interpretation purpose (e.g. if I want to depart from some 'average person' who might be rather male than female but definitely not 70% male and 30% female), but if I just want to assess the marginal effect of a variable (i.e. see if it is economically significant) I think it still makes sense to use mean values even for dummy variables. – Tobias Apr 30 '11 at 13:19
  • Greene (2008, p. 775) writes: Simply taking the derivative with respect to the binary variable as if it were continuous provides an approximation that is often surprisingly accurate. – Tobias Apr 30 '11 at 13:19
  • @Manoel Galdino, concerning your last comment: Following the argumentation of Greene, I think it would be ok to compute the mean of categorial variables with more than one category in case one is only interested in marginal effects. – Tobias Apr 30 '11 at 13:21
  • Concerning variable f025grouped: I imported this variable as a factor, so I don't think R converts it to numeric. I now think that the initial error I committed was to tell R to take the mean of a factor, which yielded probably 0 and resulted in marginal effects close to 0. However, by modifying my code (cp. my first comment), I now take the mean value of 9 (different) dummy variables and it works more or less fine. – Tobias Apr 30 '11 at 13:22

2 Answers2

1

This will do the trick for probit or logit:

mfxboot <- function(modform,dist,data,boot=1000,digits=3){
  x <- glm(modform, family=binomial(link=dist),data)
  # get marginal effects
  pdf <- ifelse(dist=="probit",
                mean(dnorm(predict(x, type = "link"))),
                mean(dlogis(predict(x, type = "link"))))
  marginal.effects <- pdf*coef(x)
  # start bootstrap
  bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
  set.seed(1111)
  for(i in 1:boot){
    samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
    x1 <- glm(modform, family=binomial(link=dist),samp1)
    pdf1 <- ifelse(dist=="probit",
                   mean(dnorm(predict(x, type = "link"))),
                   mean(dlogis(predict(x, type = "link"))))
    bootvals[i,] <- pdf1*coef(x1)
  }
  res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
  if(names(x$coefficients[1])=="(Intercept)"){
    res1 <- res[2:nrow(res),]
    res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])
    rownames(res2) <- rownames(res1)
  } else {
    res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
    rownames(res2) <- rownames(res)
  }
  colnames(res2) <- c("marginal.effect","standard.error","z.ratio")
  return(res2)
}

Source: http://www.r-bloggers.com/probitlogit-marginal-effects-in-r/

Hack-R
  • 22,422
  • 14
  • 75
  • 131
1

@Gavin is right and it's better to ask at the sister site.

In any case, here's my trick to interpret probit coefficients.

The probit regression coefficients are the same as the logit coefficients, up to a scale (1.6). So, if the fit of a probit model is Pr(y=1) = fi(.5 - .3*x), this is equivalent to the logistic model Pr(y=1) = invlogit(1.6(.5 - .3*x)).

And I use this to make a graphic, using the function invlogit of package arm. Another possibility is just to multiply all coefficients (including the intercept) by 1.6, and then applying the 'divide by 4 rule' (see the book by Gelman and Hill), i.e, divide the new coefficients by 4, and you will find out an upper bound of the predictive difference corresponding to a unit difference in x.

Here's an example.

x1 = rbinom(100,1,.5)
x2 = rbinom(100,1,.3)
x3 = rbinom(100,1,.9)
ystar = -.5  + x1 + x2 - x3 + rnorm(100)
y = ifelse(ystar>0,1,0)
probit = glm(y~x1 + x2 + x3, family=binomial(link='probit'))
xbar <- as.matrix(mean(cbind(1,ttt[1:3])))

# now the graphic, i.e., the marginal effect of x1, x2 and x3
library(arm)
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*x + probit$coef[3]*xbar[3] + probit$coef[4]*xbar[4]))) #x1
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*xbar[2] + probit$coef[3]*x + probit$coef[4]*xbar[4]))) #x2
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*xbar[2] + probit$coef[3]*xbar[3] + probit$coef[4]*x))) #x3
Manoel Galdino
  • 2,376
  • 6
  • 27
  • 40