0

I want to extract a set of rows of an existing dataset:

 dataset.x <- dataset[(as.character(dataset$type))=="x",]

however when I run

   summary(dataset.x$type)

It displays all types which were present in the original dataset. Basically I get a result that says

   x 12354235    #the correct itemcount
   y 0
   z 0
   a 0
   ...

Not only is the presence of 0 elements ugly but it also messes up any plot of dataset.x due to the presence of hundrets of entries with the value 0.

Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
Chris
  • 9,209
  • 16
  • 58
  • 74

4 Answers4

3

I'm assuming this is a factor? If so, droplevels() can be used: http://stat.ethz.ch/R-manual/R-patched/library/base/html/droplevels.html

If you add a small reproducible example, it will help others get on the same page and give better advice if this isn't right.

Chase
  • 67,710
  • 18
  • 144
  • 161
  • You don't need `gdata` anymore, I think. `droplevels` was added recently, not sure which version. – joran Jun 16 '11 at 18:53
3

Building on Chase's answer, subsetting and dropping unused levels in factors comes up a lot, so it pays to just create your own function by combining droplevels and subset:

subsetDrop <- function(...){droplevels(subset(...))}
joran
  • 169,992
  • 32
  • 429
  • 468
  • If you're using that function regularly, it's probably a sign you want a character vector, not a factor. – hadley Jun 18 '11 at 04:02
  • @hadley - Indeed, I live mostly with stringsAsFactors=FALSE. However, I happen to often want things ordered non-alphabetically when I plot them without dragging all the levels along for the ride. – joran Jun 18 '11 at 04:29
  • 1
    I wish there was a datatype that preserved order but didn't preserve levels. – hadley Jun 18 '11 at 12:54
3

Others have explained what is happening and how to fix it, I just want to show why it is a desirable default.

Consider the following sample code:

mydata <- data.frame( 
    x = factor( rep( c(0:5,0:5), c(0,5,10,20,10,5,5,10,20,10,5,0))),
    sex = rep( c('F','M'), each=50 ) )

mydata.males <- mydata[ mydata$sex=='M', ]
mydata.males.dropped <- droplevels(mydata.males)

mydata.females <- mydata[ mydata$sex=='F', ]
mydata.females.dropped <- droplevels(mydata.females)

par(mfcol=c(2,2))
barplot(table(mydata.males$x), main='Male', sub='Default')
barplot(table(mydata.females$x), main='Female', sub='Default')

barplot(table(mydata.males.dropped$x), main='Male', sub='Drop')
barplot(table(mydata.females.dropped$x), main='Female', sub='Drop')

Which produces this plot:

enter image description here

Now, which is the more meaningful comparison, the 2 plots on the left? or the 2 plots on the right?

Instead of dropping unused levels it may be better to rethink what you are doing. If the main goal is to get the count of the x's then you can use sum rather than subsetting and getting the summary. And how meaningful can a plot be on a variable that you have already forced to be a single value?

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
1

Try

dataset$type <- as.character(dataset$type)

followed by your original code. It's probably just that R is still treating that column as a factor and is keeping all of the information about that factor in the column.

Rguy
  • 1,622
  • 1
  • 15
  • 20