0

I have a panel data set which looks as follows:

library(plm)
library(Hmisc)
library(data.table)
set.seed(1)
DT <- data.table(panelID = sample(50,50),                                                    # Creates a panel ID
                      Country = c(rep("Albania",30),rep("Belarus",50), rep("Chilipepper",20)),       
                      some_NA = sample(0:5, 6),                                             
                      some_NA_factor = sample(0:5, 6),         
                      Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
                      Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
                      norm = round(runif(100)/10,2),
                      Income = round(rnorm(10,-5,5),2),
                      Happiness = sample(10,10),
                      Sex = round(rnorm(10,0.75,0.3),2),
                      Age = sample(100,100),
                      Educ = round(rnorm(10,0.75,0.3),2))           
DT [, uniqueID := .I]                                                                        # Creates a unique ID     
DT[DT == 0] <- NA                                                                            # https://stackoverflow.com/questions/11036989/replace-all-0-values-to-na
DT$some_NA_factor <- factor(DT$some_NA_factor)
DTp <- plm::pdata.frame(DT, index= c("panelID", "Time"))

I want to evaluate, for each panel observation, whether some_NA_factor or for example Countrychanges from one time period to another (a 1 for a change and a 0 for no change). I would like to write something like:

setDT(DT)[, difference := c(-1,1)*diff(some_NA_factor), by=panelID]

But I don't know how to write this when it concerns factors. If I apply this to the data.table I expectedly get:

Warning messages:
1: In Ops.factor(c(-1, 1), diff(weight)) : ‘*’ not meaningful for factors

If I apply the same thing to the pdata.frame. I get:

setDT(DTp)[, difference := c(-1,1)*diff(some_NA_factor), by=panelID]
Error in alloc.col(x) : 
  Internal error: length of names (14) is not length of dt (13)

Additionally, when apply this to my actual data I get the following error:

Supplied 107438 items to be assigned to group 1 of size 2 in column 'difference'. The RHS length must either be 1 (single values are ok) or match the LHS length exactly. If you wish to 'recycle' the RHS please use rep() explicitly to make this intent clear to readers of your code.

And I am not sure why that happens (I cannot seem to reproduce it in the example).

Any ideas?

Tom
  • 2,173
  • 1
  • 17
  • 44
  • `diff` isn't meaningful for factors. You could try to replace your `diff(some_NA_factor)` with `diff(as.numeric(some_NA_factor))` – PavoDive Oct 25 '19 at 13:37

1 Answers1

1

Let's get into this step by step.

I want to evaluate, for each panel observation, whether some_NA_factor or for example Country changes from one time period to another (a 1 for a change and a 0 for no change).

You provided the following code:

# actual code:
setDT(DT)[, difference := c(-1,1)*diff(some_NA_factor), by=panelID]

I can see some problems there. First, you don't need setDT(DT): you defined DT as a data.table, so don't need to convert it again to what it already is. Second, if you want a zero for no change and 1 for change, what did you expect to obtain multiplying by c(-1, 1)?. Last, and most important, multiplication is not meaningful for factors, so we need to convert the diff into numeric:

# proposed code:
DT[, difference := 1*(diff(as.numeric(some_NA_factor)) != 0), by=panelID]

Here we're calculating the difference of a numeric vector, which is numeric, and evaluating wether it is different from zero (that will return TRUE). We convert that to numeric multiplying by 1 (TRUE is equal to 1).

Diffing the factors in DTp

I don't have {plm} installed, but reading at the documentation it seems to me that the plm::pdata.frame function returns an object of class pdata.frame. I'm not sure about if setDT is able to covert that specific class without problems, so if I were you I would convert that pdata.frame object to a data.frame first (it uses its own S3 method), and then to data.table:

library(plm)
DTp <- setDT(as.data.frame(pdata.frame(DT, index= c("panelID", "Time"))))

Calculating the difference of some_NA_factor then will be similar to what was shown above.

PavoDive
  • 6,322
  • 2
  • 29
  • 55