I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove rows with a factor that has a low total number of instances.
I've tried the following: create a table for the student name factor
studenttable <- table(data$Anon.Student.Id)
returns a table
l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72
9890 7989 7665 7242 6928 6651
then I can get a table that tells me if there are more than 1000 data points for a given factor level
biginstances <- studenttable>1000
then I tried making a subset of the data on this query
bigdata <- subset(data, (biginstances[Anon.Student.Id]))
But I get weird subsets that still have the original number of factor levels as the full set. I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.