3

I am working on a large (but not enormous) data base of 1.1mln observations x 41 variables. Data are arranged as an unbalanced panel. Using these variables I specified three different models and I run each of them as a 1) fixed effects, 2) random effects and 3) pooled OLS regression.

The original .RData file containing only the data base is about 15Mb. The .RData containing the data base and the regression results (a total of 9 regressions) weights about 650Mb. I do realize that (from the base documentation)

An object of class c("plm","panelmodel").

A "plm" object has the following elements :

coefficients   the vector of coefficients,
vcov           the covariance matrix of the coefficients,
residuals      the vector of residuals,
df.residual    degrees of freedom of the residuals,
formula        an object of class ’pFormula’ describing the model,
model          a data.frame of class ’pdata.frame’ containing the variables usedfor the estimation: the response is in first position and the two indexes in the last positions,
ercomp         an object of class ’ercomp’ providing the estimation of the components of the
errors         (for random effects models only),
call           the call

even so, I am not able to understand why those files should be so massive. To avoid overloading the memory while working with the plm objects, I saved them in three different files (each of which weights now around 200Mb). I called summary one hour ago to see the fixed-effects model results but it hasn't showed me any results yet. My question now is pretty straightforward. Do you find this a normal behavior? Is there something I can do to reduce the plm objects size and speed up the results retrieval?

Here are some things you might want to know:

  • The data base I am using is in data.table format
  • formulas in the regressions are pre-assembled and are included in the plm calls preceded by as.formula(), as suggested here. Example:

form<-y~x1+x2+x3+...+xn

mod.fe<-plm(as.formula(form), regr, effect="individual", model="within", index=c("id", "year"))

Please, let me know if there is any other info I can provide and that you might need to answer the question.

EDIT

I managed to make up a small scale data base with similar characteristics as the one I am working on. Here it is:

structure(list(id = c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 
5L, 5L, 6L, 6L, 7L, 7L, 7L, 7L, 8L, 8L, 8L, 8L, 9L, 9L, 9L, 9L, 
10L, 10L, 11L, 11L), year = structure(c(1L, 2L, 1L, 2L, 3L, 4L, 
1L, 2L, 1L, 2L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 3L, 4L, 1L, 2L), .Label = c("2000", "2001", "2002", 
"2003"), class = "factor"), study = c(3.37354618925767, 4.18364332422208, 
5.32950777181536, 4.17953161588198, 5.48742905242849, 5.73832470512922, 
6.57578135165349, 5.69461161284364, 6.3787594194582, 4.7853001128225, 
7.98380973690105, 8.9438362106853, 9.07456498336519, 7.01064830413663, 
10.6198257478947, 9.943871260471, 9.84420449329467, 8.52924761610073, 
3.52184994489138, 4.4179415601997, 5.35867955152904, 3.897212272657, 
5.38767161155937, 4.9461949594171, 3.62294044317139, 4.58500543670032, 
7.10002537198388, 6.76317574845754, 6.83547640374641, 6.74663831986349
), ethnic = structure(c(1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 1L, 1L, 
2L, 2L, 3L, 3L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 
1L, 1L, 2L, 2L), .Label = c("hispanic", "black", "chinese"), class = "factor"), 
    sport = c(0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0), health = structure(c(1L, 
    1L, 2L, 2L, 2L, 2L, 3L, 3L, 4L, 4L, 1L, 1L, 2L, 2L, 3L, 3L, 
    3L, 3L, 4L, 4L, 4L, 4L, 1L, 1L, 1L, 1L, 2L, 2L, 3L, 3L), .Label = c("none", 
    "drink", "both", "smoke"), class = "factor"), gradec = c(2.72806403942929, 
    3.10067738633308, 4.04728186632456, 2.19701362539883, 1.73115878111307, 
    5.35879931359977, 5.79613840739381, 5.07050219214859, 4.26224490644077, 
    3.53554192927934, 6.10515669475491, 7.18032957183198, 6.73191149590581, 
    6.49512764543435, 6.4783689354808, 6.19974636196512, 5.54014977312232, 
    6.72545652880344, 1.00223129492982, 1.08994269214495, 3.06702680106689, 
    1.70103126320561, 4.82973481729635, 3.14010240687364, 3.8068435242348, 
    5.01254268106181, 5.66497772013949, 4.16303452633342, 4.2751229553617, 
    3.05652055248093), event = c(1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 
    0), evm3 = c(0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0), evm2 = c(0, 
    0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 1, 0, 0, 1, 1, 0, 0, 0, 0), evm1 = c(0, 1, 0, 1, 1, 1, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 1, 
    1, 0, 0, 0, 0), evp1 = c(0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1), 
    evp2 = c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
    0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1), evp3 = c(0, 0, 
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 
    0, 0, 0, 0, 0, 0, 0, 1, 0), ndm3 = c(1, 1, 1, 1, 1, 0, 1, 
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 
    1, 1, 1, 1), ndm2 = c(1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1), ndm1 = c(1, 
    0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 
    0, 0, 1, 0, 0, 0, 1, 0, 1, 0), ndp1 = c(0, 1, 0, 0, 0, 1, 
    0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 
    1, 0, 1, 0, 0), ndp2 = c(1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 
    1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0), 
    ndp3 = c(1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 
    1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1)), .Names = c("id", 
"year", "study", "ethnic", "sport", "health", "gradec", "event", 
"evm3", "evm2", "evm1", "evp1", "evp2", "evp3", "ndm3", "ndm2", 
"ndm1", "ndp1", "ndp2", "ndp3"), class = "data.frame", row.names = c(NA, 
30L))

The formula and the plm call I used are:

form<-gradec~year+study+ethnic+sport+health+event+evm3+evm2+evm1+evp1+evp2+evp3+ndm3+ndm2+ndm1+ndp1+ndp2+ndp3

plm.f<-plm(as.formula(form), data, effect="individual", model="within", index=c("id", "year"))

Using object.size() suggested by @BenBolker I found out that the call generated a plm object weighting 64.5Kb, while the original data frame has size 6.9Kb, which means that the results are about 10 times larger than the input matrix. Here then I set the options suggested by @zx8754 below but unfortunately they had no effect. When I finally called summary(plm.f) I got the error message:

Error in crossprod(t(X), beta) : non-conformable arguments

which I eventually got also with my large data base, but only after hours of computing. Here it is suggested that the problem might be due to the coefficient matrix being singular. However, testing for singularity with is.matrix.singular() found in the matrixcalc package it turns out that this is not the case.

Another couple of things you might want to know:

  • year, ethnic and health are factors
  • Variables in the formula are more or less self-explanatory except for the last ones. event is a supposed traumatic event happened at a certain time. It is coded 1 in case of an event in a certain year and 0 otherwise. The variable evm1 is equal to 1 if one of these events happened in the year before (minus 1) and 0 otherwise. Similarly, evp1 is 1 if the event happens in the following year (plus 1) and 0 otherwise. Variables ndm. and ndp. work in the same way but they are coded 1 when that distance is not observable (because the time period for a certain individual is too short) and 0 otherwise. The presence of so deeply connected variables raises the suspect of perfect collinearity. As told above however, a test revealed that the matrix in non-singular.

Let me tell once again that I would be very thankful if someone could answer the question.

Community
  • 1
  • 1
Riccardo
  • 743
  • 2
  • 5
  • 14
  • 1
    Because `plm`, `glm`,`lm`, etc., store the data that is used for calculation in resulting object. This might be relevant: http://stackoverflow.com/questions/13387603/how-to-save-glm-result-without-data-or-only-with-coeffients-for-prediction – zx8754 Jan 09 '14 at 16:24
  • 1
    worth noting also that the `RData` file is in compressed format on disk, so will be bigger once loaded. What is `print(object.size(...),units="Mb")` when applied to your raw data base? – Ben Bolker Jan 09 '14 at 16:44
  • Thank you both. @BenBolker I tried the test you suggested and it turns out that the data base is 309.7Mb, while the `plm` object is 1062.0Mb. Isn't that strange? @zx8754 thank you for showing that. I will try with `x=FALSE` and `y=FALSE`. I think I cannot use `model=FALSE` because `model` is a plm option that lets me choose the type of transformation I want to apply. I'll do a test a let you know how it goes. – Riccardo Jan 09 '14 at 17:10
  • 1
    This is not really an answer so I post it as a comment. I am having similar issues -- `plm` (with `twoways`) and `lm` (with dummies) want more than 50GB (32GB of RAM + 20GB swap) of memory for a dataset of 200k rows and 4 variables + individual and time effects, 11MB in memory. I ended up just using [gretl](http://gretl.sourceforge.net/), which estimates the model in less than a second and uses less than 100MB of RAM. Stata also easily estimates the model (gretl and Stata results look identical) but you need a not-exactly-inexpensive license for it. – Peter Jan 16 '14 at 09:49

1 Answers1

0

About the error message Error in crossprod(t(X), beta) : non-conformable arguments:

This is likely due to a singularity in the model matrix, just as suggested. Please keep in mind that a model matrix for fixed effects models is the transformed data (transformed data frame).

Thus, you will need to check for singularity of the transformed data. The fixed effects transformation can result in linear dependence (singularity) even if the original data are not linear dependent! The plm package has quite a good documentation about that issue in ?detect.lindep which I am going to repeat here partly (only one example):

### Example 1 ###
# prepare the data
data(Cigar)
Cigar[ , "fact1"] <- c(0,1)
Cigar[ , "fact2"] <- c(1,0)
Cigar.p <- pdata.frame(Cigar)

# setup a pFormula and a model frame
pform <- pFormula(price ~ 0 + cpi + fact1 + fact2)
mf <- model.frame(pform, data = Cigar.p)

# no linear dependence in the pooling model's model matrix
# (with intercept in the formula, there would be linear depedence)
detect.lindep(model.matrix(pform, data = mf, model = "pooling"))

# linear dependence present in the FE transformed model matrix
modmat_FE <- model.matrix(pform, data = mf, model = "within")
detect.lindep(modmat_FE)
mod_FE <- plm(pform, data = Cigar.p, model = "within")
detect.lindep(mod_FE) 
alias(mod_FE) # => fact1 == -1*fact2
plm(pform, data = mf, model = "within")$aliased # "fact2" indicated as aliased

So you should run your function to detect linear dependence on the transformed data of the model which you get by model.matrix(you_model). You can use the functions supplied by plm: detect.lindep, alias or any function that works on a matrix.

You could also look at your plm model object: your_model$aliased to see if some variables have been dropped in the estimation.

Helix123
  • 3,502
  • 2
  • 16
  • 36