I have a dataset in which one factor has a lot of levels (+/- 140) because of which (I think) the lm
function fails:
Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) :
contrasts can be applied only to factors with 2 or more levels
What I would like to do is to subset the lm-function, using only factor levels for which there are more than x
observations.
As an example, this data.table has a factor (some_NA_factor
), for which level 1, 2 , 4, 5
have 17 observations and level 3
has 16. I would like to directly (in the lm-function
) subset the dataset in such a way that it only uses the observations for which the factor level has more than 16 (at least 17) observations:
set.seed(1)
library(data.table)
DT <- data.table(panelID = sample(50,50), # Creates a panel ID
Country = c(rep("A",30),rep("B",50), rep("C",20)),
some_NA = sample(0:5, 6),
some_NA_factor = sample(0:5, 6),
Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
norm = round(runif(100)/10,2),
Income = sample(100,100),
Happiness = sample(10,10),
Sex = round(rnorm(10,0.75,0.3),2),
Age = round(rnorm(10,0.75,0.3),2),
Educ = round(rnorm(10,0.75,0.3),2))
DT [, uniqueID := .I] # Creates a unique ID
DT[DT == 0] <- NA # https://stackoverflow.com/questions/11036989/replace-all-0-values-to-na
DT$some_NA_factor <- factor(DT$some_NA_factor)
table(DT$some_NA_factor)
The normal subset syntax in lm
could for example look as follows:
lm(Happiness ~ Income + some_NA_factor, data=DT, subset=(Income > 50 & Happiness < 5))
How do I adapt the syntax to check the observations of the factor levels?