0

I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.

**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.

Here's the output of my actual data (original 37 firms)

sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)

[1] 7

s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,]) sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)

[1] 8

Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero

Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine

A simple reproducible example explains:

library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))

kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)

kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)    

So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).

the result of the tapply on the kid is the following

e  q  r  w 
7 NA  8 NA

Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:

e r 
7 8

So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong... And potentially it could solve the mystery of the factors that refuse to drop as well

Thanks Simon

SJDS
  • 1,239
  • 1
  • 16
  • 31
  • 2
    Look at `droplevels` on the data frame subset: `kid <- droplevels(dad[tapply(dad$V1,dad$factors,sum)<=9,])` – Blue Magister Apr 24 '14 at 17:43
  • possible duplicate of [dropping factor levels in a subsetted data frame in R](http://stackoverflow.com/questions/1195826/dropping-factor-levels-in-a-subsetted-data-frame-in-r) – Blue Magister Apr 24 '14 at 17:44
  • Strange, this works perfectly for the example but not for my actual data... – SJDS Apr 24 '14 at 17:48
  • Maybe use `droplevels` on just the factor vector: `kid <- dad[tapply(dad$V1,droplevels(dad$factors),sum)<=9,]`. Otherwise without something reproducible, hard to know for sure. What package is `plm.dim` from? – Blue Magister Apr 25 '14 at 14:43
  • plm.dim is a data.frame object from the plm package in which the dataframe has a time and individual index to facilitate panel regressions. I found the main problem lies somewhere else. the tapply function I am using that I hoped would eliminate some firms from my dataframe does not work properly for reasons I do not understand... – SJDS Apr 26 '14 at 16:00

0 Answers0