0

I'm currently trying to detect how many lags I should include in my linear regression analysis in R.

The study is about whether the presence of commercial military actors (CMA) correlates/causes more military- and or civil deaths. My supervisor is very keen on me using lagrange multiplier test to test for how many lags I need. However, he is not a R user and can't help me implement. He also want me to include panel corrected standard errors (PCSE) proposed by Katz and Bailey.

Short variable description DV = log_military_cas; it is a log transformation of yearly military deaths on country basis IV = CMA; dummy coded variable suggesting either CMA presence in country and year combination (1) og no presence (0) lag-variable = lag_md; log_md lagged one year. DATA = lagr

This is what my supervisor sent me: Testing for serial correlation. This is what I wrote down in my notes as a grad student: Using the Lagrange Multiplier test first recommended by Engle (1984)(but also used by Beck and Katz (1996)) this is done in two steps: 1) estimate the model and save the residuals and 2)regress these residuals on the first lag of those and the independent variable. If the lag of the residual is statistically significant in the last regression, more lags of the dependent variable are needed. <-- So just do this but with a model without any lags of dependent variable. If you find serial correlation, include a lag of DV and test again.

Question is twofold 1) what I'm I doing wrong the attached code, and 2) Should the baseline reg include pcse?

# no lag
lagtest_0a <- lm(log_military_cas ~ CMA + as.factor(country) + as.factor(year), data = lagr)

# save risiduals
lagr$Risid_0 <- resid(lagtest_0)

lagtest_0b <- lm(log_military_cas  ~ CMA + Risid_0 + as.factor(country) + as.factor(year), data = lagr)
summary(lagtest_0b)

# Risid_0 is significant, so I need at least one  lag

# lag 1
lagtest_1a <- lm(log_military_cas ~ CMA + lag_md + as.factor(country) + as.factor(year), data = lagr)

# save new risiduals
lagr$Risid1 <- resid(lagtest_1a)

# here the follwoing errorcode arrives:
Error in `$<-.data.frame`(`*tmp*`, Risid1, value = c(`2` = 1.84005148256506,  : 
  replacement has 2855 rows, data has 2856

# Then I'm thinking, maybe I shouldnt store Risid_0 in the lagr dataframe. So I try without that just storing it for itself.

# save new risiduals in new way
Risid1 <- resid(lagtest_1a)

# rerun model
lagtest1 <- lm(log_military_cas  ~ CMA + Rs_lagtest_md1 + as.factor(country) + as.factor(year), data = lagr)

# Then, the following errorcode arrives:
Error in model.frame.default(formula = log_military_cas ~ CMA + Rs_lagtest_md1 +  : 
  variable lengths differ (found for 'Rs_lagtest_md1')

it seems like the problem is, that when I include lag_md (which has NA's on first year, since its lagged) the lenght of the variables are not the same, however as far as I know, the default system in R omits NA's. I even tried to specify this with na.action = na.omit, but the same error arrives.

Hope anyone can help me

MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • For panel-corrected standard errors, see the `sandwich` package, in particular `vcovPC`. Your error looks as if you had a single missing value in your regression, hence the model frame is missing one piece of data. You might try `lm(... na.action = na.exclude)`. – dash2 Jun 16 '22 at 11:57
  • dear dash2 are vcovPC the same as PCSE proposed by Katz? Thank you for your answer. I will try na.exclude – Julian Ekberg Jun 16 '22 at 12:19
  • It doesnt work with na.exclude either :( – Julian Ekberg Jun 16 '22 at 12:21
  • If you want advice on a statistical analysis, you should ask for help at [stats.se] rather than Stack Overflow because this isn't really a specific programming question. Or at the very least you should include a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data so we can actually run and test the code. – MrFlick Jun 16 '22 at 12:50

0 Answers0