Simulating a 2-Stage Least Squares regression in R

Question

I´m really new to R and have to do the following task for a class:

This exercise shows the finite sample problem of weak instru- ments in the two-stage least squares. Let the structural equation be yi = beta1 + beta2*x_i + eta_i

while the reduced-form equation be xi = gamma1 + gamma2*z_i + e_i:

Let the sample be i.i.d., with z_i ~ N (0; 1), and (eta_i e_i) ~ N((0 0), (1 0.5 1 0.5))

Set the true parameter values as beta1 = 1; beta2 = 2, gamma1 = 0.

For each n and gamma2 combination below, please use the standard two-stage least square estimator to estimate beta2 (Write down the estimator by yourself.) Repeat this experiment for 1000 times so that we have 1000 estimated beta2's.

n = 100 and gamma2 = 1.
n = 100 and gamma2 = 0:1.
n = 100 and gamma2 = 0:01.
n = 1000 and gamma2 = 0:01.

Plot the kernel density of these beta2's in each case so that we can see the sampling distribution. Use ggplot to put the 4 subgraphs into one graph.

I have spent hours trying to figure out how to translate this into a working simulation.

My last solution has been like this:

#We will simulate data for this exercise

rm(list = ls( ) )

options(max.print=999999)

#4 different groups of parameters, only n and gamma2 change

  #First Estimation: n=100 and gamma2=1


set.seed(1313)
  #We write a function of the regression model, to later replicate that function
  #and receive 1000 simulations of it.
  b2hat_1_f = function(){
  n = 100 
  #b1 is the constant. We therefore assign it a number of values equal to n.
  b1 = matrix(1, nrow = 100)
  b2 = matrix(2, nrow = 2)
  gamma1 = matrix(0, nrow = 100)
  gamma2 = matrix(1, nrow = 2)

  #Generate Data
  #We have to take the square root of the given variance matrix as rnorm in R uses the standard deviation.
  eta_i = rnorm(n, mean = matrix(0, nrow = 2), sd = sqrt(matrix(c(1, 0.5, 0.5, 1), nrow = 2)))
  e_i = rnorm(n, mean = matrix(0, nrow = 2), sd = sqrt(matrix(c(1, 0.5, 0.5, 1), nrow = 2)))
  Z = cbind(1, rnorm(n, mean = 0, sd = 1))
  X = gamma1 + Z %*% gamma2 + e_i
  X_hat = cbind(1,X)
  Y = b1 + X_hat %*% b2 + eta_i

  #Estimate b2_hat
  #Formula for beta2 estimator is beta_hat_IV = (Z'X)^-1*Z'Y
  #This translates to 
  b2hat = solve((t(Z)%*%X_hat), t(Z)%*%Y)
  }

  b2hat_1 = replicate(1000, b2hat_1_f())

  #Plot the kernel density of our estimator
  plot(density(b2hat_1), main="Kernel Density of b2hat_1 Estimator")

  #Save data of estimator
  save(b2hat_1,file="b2hat_1.RData")`

I then have similar set-ups for the three other cases and finally try to compose them into a graph like this:

#Combine 4 subgraphs using ggplot2
  library(ggplot2)
  library(reshape2)

  load("b2hat_comb.RData")

  b2s = b2hat_comb[, c("b2hat_1", "b2hat_2", "b2hat_3", "b2hat_4")]
  print(head(b2s))

  #Put estimators into data.frame format for use of ggplot2
  x1 = data.frame(b2hat_1)
  x2 = data.frame(b2hat_2)
  x3 = data.frame(b2hat_3)
  x4 = data.frame(b2hat_4)

  ggplot() +
    # b2hat_1
    geom_density(data=x1, aes(x=x1),colour="blue", size=1) +

    # b2hat_2
    geom_density(data=x2, aes(x=x2) ,colour="red", size=1) +

    #b2hat_3
    geom_density(data=x3, aes(x=x3) ,colour="red", size=1) +

    #b2hat_4
    geom_density(data=x4, aes(x=x4) ,colour="red", size=1)

Now I know that the setup of the model is propably incorrect and know that this partly due to lack of econometrical knowlegde and misunderstanding of r. I just don´t know how to go on and am currently approaching the "crisis-mode". I´d be very grateful if one of you could find the time to take a look at what I am doing wrong, for example the cbind option for X_hat seems incorrect to me. Additionally I´m uncertain if I am using the correct formular for beta2.

I realize that this page is also not supposed to help me finish my degree, but the ggplot2 part is actually an r related question as I don´t know what I´m doing wrong to produce the combined density curves of the 4 estimators. I´ve tried different code, but none has quite worked. Here I receive the error message:

Don't know how to automatically pick scale for object of type data.frame. Defaulting to continuous.
Error: Aesthetics must be either length 1 or the same as the data (2): x

Thank you so much in advance and apologies for my whiny question. I will be happy to clarify if necessary.

You're passing two rows to ggplot and that's probably not going to work the way you want. I'd suggest transposing the two rows into the long form, combine the four data sets with a third column being the data set's name, then pass ggplot the data frame, the x axis variable, the y axis, and the fill variable being the data set's name. Example code here: http://stackoverflow.com/questions/26075181/multiple-groups-in-geom-density-plot — Ryan Morton, Mar 23 '17 at 16:31
Also, you're returning a 1x2 matrix from your function. Is that you expected? — Ryan Morton, Mar 23 '17 at 16:34
To return this kind of matrix should be correct if I am estimating a model in matrix form, right? At least this how we have done it before. — Christoph Westendorf, Mar 24 '17 at 07:18
It does however also complicate putting the data into ggplot2 as you have observed. I tried to do what you suggested. But after creating a matrix with the values of my estimator (dimensions of [2000 1]) and transposing this, leaving me with 1 row and 2000 columns, I am unsure how to combine the 4 datasets. To me it seems that I don´t fully understand how R stores the data. I would deeply appreciate a further hint. — Christoph Westendorf, Mar 24 '17 at 07:23

score 0 · Accepted Answer · answered Mar 24 '17 at 14:26

Ok, I'm assuming I'm handling the data correctly, but you feel free to correct me. I changed the output from the function to produce a 1x2 matrix, combine the replicated matrices, and use the matrix to create a data frame with two variables. I then added, through assignment, a third variable that represents the name of the data set (x1). I then pass ggplot() the data frame and have it plot the variable x using geom_density with transparency. It's written to accept all four groups (x1:x4) and should produce plots for each:

b2hat_1_f = function(){
  n = 100 
  #b1 is the constant. We therefore assign it a number of values equal to n.
  b1 = matrix(1, nrow = 100)
  b2 = matrix(2, nrow = 2)
  gamma1 = matrix(0, nrow = 100)
  gamma2 = matrix(1, nrow = 2)

  #Generate Data
  #We have to take the square root of the given variance matrix as rnorm in R uses the standard deviation.
  eta_i = rnorm(n, mean = matrix(0, nrow = 2), sd = sqrt(matrix(c(1, 0.5, 0.5, 1), nrow = 2)))
  e_i = rnorm(n, mean = matrix(0, nrow = 2), sd = sqrt(matrix(c(1, 0.5, 0.5, 1), nrow = 2)))
  Z = cbind(1, rnorm(n, mean = 0, sd = 1))
  X = gamma1 + Z %*% gamma2 + e_i
  X_hat = cbind(1,X)
  Y = b1 + X_hat %*% b2 + eta_i

  #Estimate b2_hat
  #Formula for beta2 estimator is beta_hat_IV = (Z'X)^-1*Z'Y
  #This translates to 
  b2hat = solve((t(Z)%*%X_hat), t(Z)%*%Y)
  m <- t(b2hat)
}

b2hat_1 = replicate(1000, b2hat_1_f())

m <- b2hat_1
m <- matrix(data = m , ncol=2)
df <- data.frame(x = m[,1], y = m[,2])
df$group <- "x1"

##with the 'group' variable, you don't need separate geoms
library(ggplot2)
ggplot() +
  geom_density(data = df, aes(x = x,  fill = group),alpha=0.5)

Hope that helps.

It helped a lot! Thank you very much! I also figured out the mistake in my replicated simulation and now should be okay. Thanks again for taking the time to help me. — Christoph Westendorf, Mar 26 '17 at 20:29
You're welcome. Out of curiosity, what was the error in the simulation? — Ryan Morton, Mar 27 '17 at 14:05

Simulating a 2-Stage Least Squares regression in R

1 Answers1