R plm thinks my number vector is a factor, why?

Question

With this data input:

A   B   C   D
0.0513748973337 0.442624990365  0.044669941640565   12023787.0495
-0.047511808790502  0.199057057555  0.067542653775225   6674747.75598
0.250333519823608   0.0400359422093 -0.062361320324768  10836244.44
0.033600922318947   0.118359141703  0.048493523722074   7521473.94034
0.00492552770819    0.0851342003243 0.027123088894137   8742685.39098
0.02053037069955    0.0535545969759 0.06352586720282    8442677.4204
0.09050961131549    0.044871795257  0.049363888991624   7223126.70424
0.082789930841618   0.0230375009412 0.090676778601245   8974611.5623
0.06396481119371    0.0467280364963 0.128097065131764   8167179.81463

and this code:

library(plm);
mydata <- read.csv("reproduce_small.csv", sep = "\t");
plm(C ~ log(D), data = mydata, model = "pooling"); # works
plm(A ~ log(B), data = mydata, model = "pooling"); # error

the second plm call returns the following error:

Error in Math.factor(B) : ‘log’ not meaningful for factors

reproduce_small.csv contains the ten lines of data pasted above. Obviously, B is not a factor, it is clearly a numeric vector. This means that plm thinks it is a factor. The questions are "why?", but more importantly "how do I fix this?"

Things I've tried:

#1) mydata$B.log <- log(mydata$B) results in

Error in model.frame.default(formula = y ~ X - 1, drop.unused.levels = TRUE) : 
  variable lengths differ (found for 'X')

which is in itself weird, since A and B.log have clearly the same length.

#2) plm(A ~ log(D), data = mydata, model = "pooling"); results in the same error as #1.

#3) plm(C ~ log(B), data = mydata, model = "pooling"); results in the same original error (log not meaningful for factors).

#4) plm(A ~ log(B + 1), data = mydata, model = "pooling"); results in

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels
In addition: Warning message:
In Ops.factor(B, 1) : ‘+’ not meaningful for factors

#5) plm(A ~ as.numeric(as.character(log(B))), data = mydata, model = "pooling"); results in the same original error (log not meaningful for factors).

EDIT: As suggested, I'm including the result of str(mydata):

> str(mydata)
'data.frame':   9 obs. of  4 variables:
 $ A: num  0.05137 -0.04751 0.25033 0.0336 0.00493 ...
 $ B: num  0.4426 0.1991 0.04 0.1184 0.0851 ...
 $ C: num  0.0447 0.0675 -0.0624 0.0485 0.0271 ...
 $ D: num  12023787 6674748 10836244 7521474 8742685 ...

Also trying mydata <- read.csv("reproduce_small.csv", sep = "\t", stringsAsFactors = FALSE); didn't work.

Did you try coercing it into a numeric? maybe read.csv considers it a character vector and makes a factor out of it — Robin Gertenbach, Dec 13 '16 at 10:21
> str(mydata) 'data.frame': 9 obs. of 4 variables: $ A: num 0.05137 -0.04751 0.25033 0.0336 0.00493 ... $ B: num 0.4426 0.1991 0.04 0.1184 0.0851 ... $ C: num 0.0447 0.0675 -0.0624 0.0485 0.0271 ... $ D: num 12023787 6674748 10836244 7521474 8742685 ... — Mikk, Dec 13 '16 at 10:25
use `str(yourdatafile)` to check the structure. and also try to use `stringsAsFactors = FALSE` when reading data from `csv` file. — Zico, Dec 13 '16 at 10:26
OK, yes I can reproduce. The fault is definitely not with reading in the data. Seems very strange. — Axeman, Dec 13 '16 at 10:29
@Zico `mydata <- read.csv("reproduce_small.csv", sep = "\t", stringsAsFactors = FALSE);` doesn't fix the problem. — Mikk, Dec 13 '16 at 10:30
Also note that fitting `C ~ B` gives you coefficients as if `B` is a factor, but `as.numeric(B)` works. `C ~ log(as.numeric(as.character(B))` work too, but `A ~ log(as.numeric(as.character(B))` doesn't, and neither does `A ~ B`. — Axeman, Dec 13 '16 at 10:32
You would need to make `mydata` a `pdata.frame` before. And you would need an index for individual and time dimension of your data. Please see the vignette of plm. — Helix123, Dec 13 '16 at 12:59
@Helix123 that was it. You can post that as answer and I'll accept it. Thanks! — Mikk, Dec 13 '16 at 13:12
You are not showing here any panel `index` and you are going for `pooled regression`. So why not use simple `lm`? why go for `plm`? — Zico, Dec 13 '16 at 17:53
`lm(A ~ log(B), data = test_df) Call: lm(formula = A ~ log(B), data = test_df) Coefficients: (Intercept) log(B) -0.04982 -0.04323 ` — Zico, Dec 13 '16 at 17:53
@Zico: one reason could be to be able to use some kind of robust standard errors provided by, e. g., `plm::vcovHC`. — Helix123, Dec 14 '16 at 09:43
I am not sure why on a non-panel dataset, plm is being applied to get robust std. error. there are few packages which can do it for you even after running lm. like `sandwich` [package](https://cran.r-project.org/web/packages/sandwich/sandwich.pdf) in R. please have a look into the [link](http://thomasleeper.com/Rcourse/Tutorials/olsrobustSEs.html) — Zico, Dec 14 '16 at 10:13
also we have a old post from [StackOverFlow](http://stackoverflow.com/questions/4385436/regression-with-heteroskedasticity-corrected-standard-errors) itself. — Zico, Dec 14 '16 at 10:14
@Zico: This is just a minimal example to reproduce the error. My actual regression had indexes. For some reason (which now appears obviously dependent on the column order of the input files), in some cases the data was loaded correctly and in some others it didn't. So I reproduced the latter with the minimal amount of info possible. — Mikk, Dec 14 '16 at 10:17

Mikk · Accepted Answer · 2016-12-14T10:19:57.147

1

Helix123 in the comments pointed out that the data.frame should be converted to a pdata.frame. So, for instance, a solution to this toy example will be:

mydata$E <- c("x", "x", "x", "x", "x", "y", "y", "y", "y"); # Create E as an "index"
mydata <- pdata.frame(mydata, index = "E"); # convert to pdata.frame
plm(A ~ log(B), data = mydata, model = "pooling"); # now it works!

EDIT: As to "why" this happens, as Helix123 pointed out in comments, is that, when passed a data.frame instead of a pdata.frame, plm quietly assumes that the first two columns are indexes, and converts them to factor under the hood. Then plm will throw an unhelpful error, instead of launching a warning that the object passed is not of the correct type, or that it made an assumption at all.

edited Dec 14 '16 at 10:19

answered Dec 13 '16 at 14:13

Mikk

804
8
23

if you check plm in details, there is an auto check for index and Checking is done: whether data is a pdata.frame and if not create it. but in this case mysteriously it is not working. – Zico Dec 13 '16 at 17:48
1

The first two columns will be assumed as index variables (and those will be converted to factors), see ?pdata.frame – Helix123 Dec 14 '16 at 09:42

R plm thinks my number vector is a factor, why?

1 Answers1