"Subset" in R does not subset the way I want it to

Question

Possible Duplicate:
dropping factor levels in a subsetted data frame in R

I am getting a little frustrated with R here, it would be great if anyone could help me with the following:I am trying to pull a subset out of my dataset but it does not work properly.

Specifics: I have a spreadsheet with words and different features associated with each word e.g. word article length ... ... Now I am trying to look at individual words, e.g. pull out all instances where the word is "hairbrush". To do so, I tried:

hairbrush=subset(dataset, word=="hairbrush")

This seems to work fine and gives me the right dataset when I look at it with fix or head. However, as soon as I try to do things like xtabs or any kind of computation, I do not get very far because all the other words are still "there" and mess up my stats. E.g. when I do levels, it gives me "hairbrush", but also all other 200 words. All the data pertaining to these "hidden words" is NA but it still messes up my stats.

Is that the usual behavior of subset? Or am I doing something wrong? Or is this the wrong approach?

Oh, and in some similar questions on Google, people always asked for the output of str, so here it is:

> str(hairbrush)
'data.frame':   41 obs. of  10 variables:
 $ id       : Factor w/ 1352 levels "1-1-1-11-a.eaf",..: 210 240 267 295 320 351 378 403 427 452 ...
 $ speaker  : num  24 25 26 28 29 30 32 33 34 35 ...
 $ loc      : Factor w/ 2 levels "nb","xx": 1 1 1 1 1 1 1 1 1 1 ...
 $ gilbertno: Factor w/ 27 levels "1","10","108",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ tword    : Factor w/ 65 levels "abaddream","afuneral",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ word     : Factor w/ 228 levels "abbe","aepfel",..: 164 93 99 93 92 100 94 94 28 93 ...
 $ loan     : Factor w/ 5 levels "FILE","maybe",..: 4 3 5 3 5 5 3 3 3 3 ...
 $ article  : Factor w/ 40 levels "a","das","dat",..: 34 34 33 33 34 34 34 34 13 34 ...
 $ gender   : Factor w/ 13 levels "a","af","amn",..: 11 11 7 7 11 11 11 11 7 11 ...
 $ comment  : Factor w/ 4 levels "0","die macht ja vorschlaege",..: 1 1 1 1 1 1 1 1 1 1 ...

score 4 · Accepted Answer · answered Nov 24 '12 at 15:33

4

You need to use droplevels after subsetting to clean out unused levels.

answered Nov 24 '12 at 15:33

Ben Bolker

211,554
25
370
453

thanks so much ben, that did the trick! can't believe that is not in my textbook... – patrick Nov 24 '12 at 15:41

score 3 · Answer 2 · answered Nov 24 '12 at 15:35

subset is working as intended. The problem you are having is due to word being a factor. When you subset the data.frame, subset doesn't redefine your variables, so word continues to carry with it all of the level information that was part of the original dataset. Try using droplevels to drop all of the unused levels from your data.frame.

"Subset" in R does not subset the way I want it to

2 Answers2