2

I'm dealing with the KDD 2010 data https://pslcdatashop.web.cmu.edu/KDDCup/downloads.jsp In R, how can I remove rows with a factor that has a low total number of instances.

I've tried the following: create a table for the student name factor

studenttable <- table(data$Anon.Student.Id)

returns a table

l5eh0S53tB Qwq8d0du28 tyU2s0MBzm dvG32rxRzQ i8f2gg51r5 XL0eQIoG72 
  9890       7989       7665       7242       6928       6651 

then I can get a table that tells me if there are more than 1000 data points for a given factor level

biginstances <- studenttable>1000

then I tried making a subset of the data on this query

bigdata <- subset(data, (biginstances[Anon.Student.Id]))

But I get weird subsets that still have the original number of factor levels as the full set. I'm simply interested in removing the rows that have a factor that isn't well represented in the dataset.

Harry Moreno
  • 10,231
  • 7
  • 64
  • 116

4 Answers4

5

There are probably more efficient ways to do this but this should get you what you want. I didn't use the names you used but you should be able to follow the logic just fine (hopefully!)

# Create some fake data
dat <- data.frame(id = rep(letters[1:5], 1:5), y = rnorm(15))
# tabulate the id variable
tab <- table(dat$id)
# Get the names of the ids that we care about.
# In this case the ids that occur >= 3 times
idx <- names(tab)[tab >=3]
# Only look at the data that we care about
dat[dat$id %in% idx,]
Dason
  • 60,663
  • 9
  • 131
  • 148
4

@Dason gave you some good code to work with as a starting point. I'm going to try to explain why (I think) what you tried didn't work.

biginstances <- studenttable>1000

This will create a logical vector whose length is equal the number of unique student id's. studenttable contained a count for each unique value of data$Anon.Student.Id. When you try to use that logical vector in subset:

bigdata <- subset(data, (biginstances[Anon.Student.Id]))

it's length is almost surely much less than the number of rows in data. And since the subsetting criteria in subset is meant to identify rows of data, R's recycling rules take over and you get 'weird' looking subsets.

I would also add that taking subsets to remove rare factor levels will not change the levels attribute of the factor. In other words, you'll get a factor back with no instances of that level, but all of the original factor levels will remain in the levels attribute. For example:

> fac <- factor(rep(letters[1:3],each = 3))
> fac
[1] a a a b b b c c c
Levels: a b c
> fac[-(1:3)]
[1] b b b c c c
Levels: a b c
> droplevels(fac[-(1:3)])
[1] b b b c c c
Levels: b c

So you'll want to use droplevels if you want to ensure that those levels are really 'gone'. Also, see options(stringsAsFactors = FALSE).

joran
  • 169,992
  • 32
  • 429
  • 468
0

Another approach will involve a join between your dataset and the table of interest. I'll use plyr for my purpose but it can be done using base function (like merge and as.data.frame.table)

require(plyr)

set.seed(123)
Data <- data.frame(var1 = sample(LETTERS[1:5], size = 100, replace = TRUE),
                   var2 = 1:100)


R> table(Data$var1)

 A  B  C  D  E 
19 20 21 22 18 


## rows with category less than 20

mytable <- count(Data, vars = "var1")

## mytable <- as.data.frame(table(Data$var1))

R> str(mytable)
'data.frame':   5 obs. of  2 variables:
 $ var1: Factor w/ 5 levels "A","B","C","D",..: 1 2 3 4 5
 $ freq: int  19 20 21 22 18

Data <- join(Data, mytable)

## Data <- merge(Data, mytable)

R> str(Data)
'data.frame':   100 obs. of  3 variables:
 $ var1: Factor w/ 5 levels "A","B","C","D",..: 3 2 3 5 3 5 5 4 3 1 ...
 $ var2: int  1 2 3 4 5 6 7 8 9 10 ...
 $ freq: int  21 20 21 18 21 18 18 22 21 19 ...



mysubset <- droplevels(subset(Data, freq > 20))

R> table(mysubset$var1)

 C  D 
21 22 

Hope this help..

dickoa
  • 18,217
  • 3
  • 36
  • 50
0

this is how I managed to do this. I sorted the table of factors and associated counts.

studenttable <- sort(studenttable, decreasing=TRUE)

now that it's in order we could use column ranges sensibly. So I got the number of factors that are represented more than 1000 times in the data.

sum(studenttable>1000)
230
sum(studenttable<1000)
344
344+230=574

now we know the first 230 factor levels are the ones we care about. So, we can do

idx <- names(studenttable[1:230])
bigdata <- data[data$Anon.Student.Id %in% idx,]

we can verify it worked by doing

bigstudenttable <- table(bigdata$Anon.Student.Id)

to get a print out and see all the factor levels with less than 1000 instances are now 0.

Harry Moreno
  • 10,231
  • 7
  • 64
  • 116
  • That's pretty much how I would have done it. You may find, however, that if these are really factor variables that you may want to remove the extranous levels with `bigdata$Anon.Student.Id <- factor(bigdata$Anon.Student.Id)` – IRTFM Nov 07 '11 at 15:51
  • I just noticed that Manuel Ramon's answer to http://stackoverflow.com/questions/8036700/removing-repeated-obs-data-if-n-obs-x-in-r would be a very succinct way of doing this. – IRTFM Nov 07 '11 at 16:36