How to run regressions on multidimensional panel data in R

Question

I need to run a regression on a panel data . It has 3 dimensions (Year * Company * Country). For example:

============================================
 year | comp | count |  value.x |  value.y
------+------+-------+----------+-----------
 2000 |   A  |  USA  |  1029.0  |  239481   
------+------+-------+----------+-----------
 2000 |   A  |  CAN  |  2341.4  |  129333   
------+------+-------+----------+-----------
 2000 |   B  |  USA  |  2847.7  |  187319   
------+------+-------+----------+-----------
 2000 |   B  |  CAN  |  4820.5  |  392039
------+------+-------+----------+-----------
 2001 |   A  |  USA  |  7289.9  |  429481
------+------+-------+----------+-----------
 2001 |   A  |  CAN  |  5067.3  |  589143
------+------+-------+----------+-----------
 2001 |   B  |  USA  |  7847.8  |  958234
------+------+-------+----------+-----------
 2001 |   B  |  CAN  |  9820.0  | 1029385
============================================

However, the R package plm seems not able to cope with more than 2 dimension.

I have tried

result <- plm(value.y ~ value.x, data = dataname, index = c("comp","count","year"))

and it returns error:

Error in pdata.frame(data, index) : 
'index' can be of length 2 at the most (one individual and one time index)

How do you run regressions when the panel data (individual * time) has more than 1 dimension within "individual"?

In case anyone encounters the same situation, I'll put my solutions here:

R seems unable to cope with this situation. And the only thing you can do is to add dummies. If the categorical variables according to which you add dummies contains too much categories, you can try this:

makedummy <- function(colnum,data,interaction = FALSE,interation_varnum)
{
  char0 = colnames(data)[colnum]
  char1 = "dummy"
  tmp = unique(data[,colnum])
  valname = paste(char0,char1,tmp,sep = ".")
  valname_int = paste(char0,char1,"int",tmp,sep = ".")
  for(i in 1:(length(tmp)-1))
  {
    if(!interaction)
    {
      tmp_dummy <- ifelse(data[,colnum]==tmp[i],1,0)
    }
    if(interaction)
    {
      index = apply(as.matrix(data[,colnum]),1,identical,y = tmp[i])
      tmp_dummy = c()
      tmp_dummy[index] = data[index,interation_varnum]
      tmp_dummy[!index] = 0
    }
    tmp_dummy <- data.frame(tmp_dummy)
    if(!interaction)
    {
      colnames(tmp_dummy) <- valname[i]
    }
    if(interaction)
    {
      colnames(tmp_dummy) <- valname_int[i]
    }
    data<-cbind(data,tmp_dummy)
  }
  return(data)
}

for example:

## Create fake data
fakedata <- matrix(rnorm(300),nrow = 100)
cate <- LETTERS[sample(seq(1,10),100, replace = TRUE)]
fakedata <- cbind.data.frame(cate,fakedata)

## Try this
fakedata <- makedummy(1,fakedata)

## If you need to add dummy*x to see if there is any influences of different categories on the coefficients, try this
fakedata <- makedummy(1,fakedata,interaction = TRUE,interaction_varnum = 2)

Maybe a little bit verbose here, I didn't polish it. Any advice is welcome. Now you can perform OLS on your data.

if you want to control for another dimension, simply add a dummy for it — Helix123, Nov 23 '17 at 21:08

score 3 · Answer 1 · answered Nov 23 '17 at 21:08

3

If you want to control for another dimension in a within model, simply add a dummy for it:

plm(value.y ~ value.x + count, data = dataname, index = c("comp","year"))

Alternatively (especially for high-dimensional data), look at the lfe package which can 'absorb' the additional dimension so the summary output is not polluted by the dummy variable.

answered Nov 23 '17 at 21:08

Helix123

3,502
2
16
36

This suggestion did not work for me. The data structure is the same as OP's question. – msh855 Jun 03 '18 at 16:48

Rodrigo Remedio · Accepted Answer · 2017-11-24T21:24:36.623

1

This question is much like these:

You may not want to create a new dummy, then with dplyr package you can use the group_indices function. Although it do not support mutate, the following approach is straightforward:

fakedata$id <- fakedata %>% group_indices(comp, count)

The id variable will be your first panel dimension. So, you need to set the plm index argument to index = c("id", "year").

For alternatives you can take a look at this question: R create ID within a group.

edited Nov 24 '17 at 21:24

answered Nov 24 '17 at 21:03

Rodrigo Remedio

640
6
20

Thank you very much! It works. But I am still quite confused: is your solution doing a pooling with "comp" and "count" , because the "id" var looks like the Cartesian product of "comp" and "count"? Can the solution fix effects for "comp" and "count" respectively rather than collectively? (I hope I've made myself clear…) – para19bellum Nov 25 '17 at 02:29
That's a cartesian product, you are rigth. But I confess I couldn't understand the scond part of you question... Would you like to firm 'A' from 'US' be treated as different of firm 'A' from 'Canada'? – Rodrigo Remedio Nov 25 '17 at 10:30
The econometric model that your answer is suggesting is 'Y_ict = beta * X_ict + f_ic + g_t + h_i + epsilon_ict' , isn't it? (i: firm. c: country. t: year. Y_ict: value.y across i, c, and t. X_ict is similar. f_ic: FE of the id var you create. g_t: time FE. h_i: FE of firms%%as you add factor(firm) in the formula of plm%%). I am just not quite sure if the result it is exactly the same with 'Y_ict = beta * X_ict + j_c + g_t + h_i + epsilon_ict'. (Hope you understand in this way) – para19bellum Nov 25 '17 at 11:13
In fact, the 'id' var I suggested treats firms from differents countries as different. – Rodrigo Remedio Nov 25 '17 at 11:33
OK. If the model is exactly what I want, the problem is solved! Thank you for helping me out! – para19bellum Nov 25 '17 at 11:39
You can make an OLS on 'Y_ict = beta * X_ict + j_c + g_t + h_i + epsilon_ict', where j_c + g_t + h_i are dummy variables. It is the same as the within model on 'Y_ict = beta_ic + beta * X_ict + epsilon_ict'. Gujarati's undergrad textbook has a whole section (The Fixed Effects or Least-Squares Dummy Variable Regression Model) where he discusses the two approaches. – Rodrigo Remedio Nov 25 '17 at 11:44
2

As of `dplyr 0.7.4` you can use `group_indices` in `mutate`. So OP could do `fakedata <- fakedata %>% mutate(id = group_indices(., comp, count))` – MeetMrMet Dec 10 '17 at 17:07

msh855 · Answer 3 · 2018-03-20T11:06:42.397

0

I think you can also do:

df <-transform(df, ID = as.numeric(interaction(comp, count, drop=TRUE)))

And then estimate

result <- plm(value.y ~ value.x, data = df, index = ("ID","year"))

edited Mar 20 '18 at 11:06

answered Mar 20 '18 at 10:59

msh855

1,493
1
15
36

score -3 · Answer 4 · answered Nov 23 '17 at 02:21

-3

I think you want to use lm() instead of plm(). This blog post here discusses what you're after:

https://www.r-bloggers.com/r-tutorial-series-multiple-linear-regression/

for your example I'd imagine it would look something like the following:

lm(formula = comp ~ count + year, data = dataname)

answered Nov 23 '17 at 02:21

DavimusPrime

368
4
17

1

OLS is not optimal for panel data since it does not adjust for entity-specific fixed effects nor does it adjust for autocorrelation of the errors. Panel regression (`plm`) is indeed one option that OP can try – acylam Nov 23 '17 at 04:01

How to run regressions on multidimensional panel data in R

4 Answers4

Linked