Sub setting observations by factor levels with more than x observations

Question

I have a dataset in which one factor has a lot of levels (+/- 140) because of which (I think) the lm function fails:

Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : 
  contrasts can be applied only to factors with 2 or more levels

What I would like to do is to subset the lm-function, using only factor levels for which there are more than x observations.

As an example, this data.table has a factor (some_NA_factor), for which level 1, 2 , 4, 5 have 17 observations and level 3 has 16. I would like to directly (in the lm-function) subset the dataset in such a way that it only uses the observations for which the factor level has more than 16 (at least 17) observations:

set.seed(1)
library(data.table)
DT <- data.table(panelID = sample(50,50),                                                    # Creates a panel ID
                      Country = c(rep("A",30),rep("B",50), rep("C",20)),       
                      some_NA = sample(0:5, 6),                                             
                      some_NA_factor = sample(0:5, 6),         
                      Group = c(rep(1,20),rep(2,20),rep(3,20),rep(4,20),rep(5,20)),
                      Time = rep(seq(as.Date("2010-01-03"), length=20, by="1 month") - 1,5),
                      norm = round(runif(100)/10,2),
                      Income = sample(100,100),
                      Happiness = sample(10,10),
                      Sex = round(rnorm(10,0.75,0.3),2),
                      Age = round(rnorm(10,0.75,0.3),2),
                      Educ = round(rnorm(10,0.75,0.3),2))           
DT [, uniqueID := .I]                                                                        # Creates a unique ID     
DT[DT == 0] <- NA                                                                            # https://stackoverflow.com/questions/11036989/replace-all-0-values-to-na
DT$some_NA_factor <- factor(DT$some_NA_factor)
table(DT$some_NA_factor)

The normal subset syntax in lm could for example look as follows:

lm(Happiness ~ Income + some_NA_factor, data=DT, subset=(Income > 50 & Happiness < 5))

How do I adapt the syntax to check the observations of the factor levels?

I do not completely understand why you want to modify your data within the lm function. Why don't you first get your data correctly filtered and then apply the lm? — Koot6133, Aug 14 '19 at 14:24
Because I'm changing subsets all the time with very big datasets, which would, memory wise, make it very problematic to store each subset. — Tom, Aug 14 '19 at 14:26
Would it be too memory intensive to add a column with the number of obs in each subset? Like `DT$num<-ave(DT$uniqueID,DT$some_NA_factor,FUN=function(x)length(unique(x)))` — CrunchyTopping, Aug 14 '19 at 14:29

score 2 · Accepted Answer · answered Aug 14 '19 at 14:30

Consider building a boolean vector using Filter and isTRUE from your table call and then run an %in% in subset argument:

boolean_vec <- Filter(isTRUE, table(DT$some_NA_factor) > 16)
boolean_vec
#    1    2    4    5 
# TRUE TRUE TRUE TRUE 

lm(Happiness ~ Income + some_NA_factor, data=DT, 
   subset=(Income > 50 & Happiness < 5 & some_NA_factor %in% names(boolean_vec)))

score 1 · Answer 2 · answered Aug 14 '19 at 14:34

Or use the %>% function from dplyr, so you do not have to store each subset seperately:

library(dplyr)
DT %>% filter(!is.na(some_NA_factor)) %>% 
count(some_NA_factor) %>% filter(n > 16) %>% inner_join(DT, by = 
'some_NA_factor') %>%
lm(Happiness ~ Income + some_NA_factor, data = .)

Sub setting observations by factor levels with more than x observations

2 Answers2