8

Originally, I mainly want to run a probit/logit model with clustered standard error in R which is quite intuitive in Stata. I came across with the answer here Logistic regression with robust clustered standard errors in R. Therefore, I tried to compare the result from Stata and from R both with the robust standard error and clustered standard error. But I noticed that the outputs for both standard errors across software are not exactly the same. However, if I use the method suggested here https://diffuseprior.wordpress.com/2012/06/15/standard-robust-and-clustered-standard-errors-computed-in-r/. I can get the exact output both from R and Stata for linear regression. Therefore,I am afraid wether the code I wrote in R is not correct and what command to use if I want to run a probit model instead of a logit model.Or if there is any elegant alternatives to solve this? Thanks.

R code

## 1. linear regression
library(rms) 
# model<-lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width,iris)
summary(model)
fit=ols(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, x=T, y=T, data=iris)
fit
robcov(fit) #robust standard error
robcov(fit, cluster=iris$Species) #clustered standard error


## 2. logistic regression
##demo data generation   
set.seed(1234)
subj<-rep(1:20,each=4)
con1<-rep(c(1,0),40)
con2<-rep(c(1,1,0,0),20) 
effect<-rbinom(80,1,0.34)
data<-data.frame(subj,con1,con2,effect)
library(foreign);write.dta(data,'demo_data.dta')

library(rms)
fit=lrm(effect ~ con1 + con2, x=T, y=T, data=data)
fit
robcov(fit)  ##robust standard error
robcov(fit, cluster=data$subj) ## clustered standard error

Stata code

## 1. linear regression
webuse iris
reg seplen sepwid petlen petwid
reg seplen sepwid petlen petwid,r
reg seplen sepwid petlen petwid,cluster(iris)


## 2. logistic regression

use demo_data,clear
logit effect con1 con2
logit effect con1 con2,r
logit effect con1 con2,cluster(subj)
Community
  • 1
  • 1
johnsonzhj
  • 517
  • 1
  • 5
  • 23
  • 2
    Could you specify what `not exactly the same` means? There are a lot of defaults involved that are probably different. It is a priori unclear which defaults are better. But if you want to get exactly the same values, you need to figure out which defaults `Stata` and `robcov` use, and adjust them accordingly. – coffeinjunky May 30 '16 at 15:28
  • Thanks for your comment, I have edited my question to give additional information – johnsonzhj May 30 '16 at 15:37
  • Is it possible that you are using logit without first running logistic? "`logistic` displays estimates as odds ratios; to view coefficients, type logit after running logistic" ([source](http://www.stata.com/manuals13/rlogistic.pdf#rlogistic)) – noumenal May 30 '16 at 15:50
  • Thanks for your suggestion, but it should not affect the P values which are not consistent across the two softwares. – johnsonzhj May 30 '16 at 15:53
  • The stata output is the same as the one achieved through vcovHC. Actually I just checked the method proposed here: [link](http://stackoverflow.com/questions/33927766/logit-binomial-regression-with-clustered-standard-errors), I can get the same output as the one I get from stata, so still be curious how I can achieve this using **rms** package. – johnsonzhj May 30 '16 at 16:08
  • Nothing special, I just want to run the logit/probit regression with clustered standard error and saw that package **rms** are designed to do this. Can you further edit your answer a bit to make it output the result including the P values automatically with the Sandwich package? Thanks. – johnsonzhj May 30 '16 at 16:24
  • Seen that, thanks, I will go through this carefully and will give you feedback later. – johnsonzhj May 30 '16 at 16:35
  • New package that might be of interest to you: https://cran.r-project.org/web/packages/multiwayvcov/multiwayvcov.pdf – coffeinjunky Jun 21 '16 at 14:14

1 Answers1

11

I prefer the sandwich package to compute robust standard errors. One reason is its excellent documentation. See vignette("sandwich") which clearly shows all available defaults and options, and the corresponding article which explains how you can use ?sandwich with custom bread and meat for special cases.

We can use sandwich to figure out the difference between the options you posted. The difference will most likely be the degree of freedom correction. Here a comparison for the simple linear regression:

library(rms)
library(sandwich)

fitlm <-lm(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width,iris)

#Your Blog Post:
X <- model.matrix(fitlm)
n <- dim(X)[1]; k <- dim(X)[2]; dfc <- n/(n-k)    
u <- matrix(resid(fitlm))
meat1 <- t(X) %*% diag(diag(crossprod(t(u)))) %*% X
Blog <- sqrt(dfc*diag(solve(crossprod(X)) %*% meat1 %*% solve(crossprod(X))))

# rms fits:
fitols <- ols(Sepal.Length~Sepal.Width+Petal.Length+Petal.Width, x=T, y=T, data=iris)
Harrell <- sqrt(diag(robcov(fitols, method="huber")$var))
Harrell_2 <- sqrt(diag(robcov(fitols, method="efron")$var))

# Variations available in sandwich:    
variations <- c("const", "HC0", "HC1", "HC2","HC3", "HC4", "HC4m", "HC5")
Zeileis <- t(sapply(variations, function(x) sqrt(diag(vcovHC(fitlm, type = x)))))
rbind(Zeileis, Harrell, Harrell_2, Blog)

          (Intercept) Sepal.Width Petal.Length Petal.Width
const       0.2507771  0.06664739   0.05671929   0.1275479
HC0         0.2228915  0.05965267   0.06134461   0.1421440
HC1         0.2259241  0.06046431   0.06217926   0.1440781
HC2         0.2275785  0.06087143   0.06277905   0.1454783
HC3         0.2324199  0.06212735   0.06426019   0.1489170
HC4         0.2323253  0.06196108   0.06430852   0.1488708
HC4m        0.2339698  0.06253635   0.06482791   0.1502751
HC5         0.2274557  0.06077326   0.06279005   0.1454329
Harrell     0.2228915  0.05965267   0.06134461   0.1421440
Harrell_2   0.2324199  0.06212735   0.06426019   0.1489170
Blog        0.2259241  0.06046431   0.06217926   0.1440781
  1. The result from the blog entry is equivalent to HC1. If the blog entry is similar to your Stata output, Stata uses HC1.
  2. Frank Harrel's function yields results similar to HC0. As far as I understand, this was the first proposed solution and when you look through vignette(sandwich) or the articles mentioned in ?sandwich::vcovHC, other methods have slightly better properties. They differ in their degree of freedom adjustments. Also note that the call to robcov(., method = "efron") is similar to HC3.

In any case, if you want identical output, use HC1 or just adjust the variance-covariance matrix approriately. After all, after looking at vignette(sandwich) for the differences between different versions, you see that you just need to rescale with a constant to get from HC1 to HC0, which should not be too difficult. By the way, note that HC3 or HC4 are typically preferred due to better small sample properties, and their behavior in the presence of influential observations. So, you probably want to change the defaults in Stata.

You can use these variance-covariance matrices by supplying it to appropriate functions, such as lmtest::coeftest or car::linearHypothesis. For instance:

library(lmtest)
coeftest(fitlm, vcov=vcovHC(fitlm, "HC1"))

t test of coefficients:

              Estimate Std. Error t value  Pr(>|t|)    
(Intercept)   1.855997   0.225924  8.2151 1.038e-13 ***
Sepal.Width   0.650837   0.060464 10.7640 < 2.2e-16 ***
Petal.Length  0.709132   0.062179 11.4046 < 2.2e-16 ***
Petal.Width  -0.556483   0.144078 -3.8624 0.0001683 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

For cluster-robust standard errors, you'll have to adjust the meat of the sandwich (see ?sandwich) or look for a function doing that. There are already several sources explaining in excruciating detail how to do it with appropriate codes or functions. There is no reason for me to reinvent the wheel here, so I skip this.

There is also a relatively new and convenient package computing cluster-robust standard errors for linear models and generalized linear models. See here.

Community
  • 1
  • 1
coffeinjunky
  • 11,254
  • 39
  • 57