1

I apologize in advanced if this is somewhat of a noob question but I looked in the forum and couldn't find a way to search what I am trying to do. I have a training set and I am trying to find a way to reduce the number of levels I have for my categorical variables (In the example below the category is the state). I would like to map the state to the mean or rate of the level. My training set would look like the following once input into a data frame:

    state class mean
1      CA     1    0
2      AZ     1    0
3      NY     0    0
4      CA     0    0
5      NY     0    0
6      AZ     0    0
7      AZ     1    0
8      AZ     0    0
9      CA     0    0
10     VA     1    0

I would like the third column in my data frame to be the mean of the first column(state) based on the class variable. so the mean for CA rows will be 0.333 ... so that the mean column could be used as a replacement for the state column Is there some good way of doing this without writing an explicit loop in R?

How does one go about mapping new levels (example new states) if my training set didn't include them? Any link to approaches in R would be greatly appreciated.

Stu Thompson
  • 38,370
  • 19
  • 110
  • 156
ak3nat0n
  • 6,060
  • 6
  • 36
  • 59

2 Answers2

3

This is really what the ave function was designed for. It can really be used to construct any functional result by category, but its default funciton is mean hence the name, ie, ave-(rage):

dfrm$mean <- with( dfrm, ave( class, state ) ) #FUN=mean is the default "setting"
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • I accepted this answer because it doesn't require me to use an external package. Could you please reverse state and class in your answer? ex: with(dfrm,ave(class,state)) – ak3nat0n Jan 07 '12 at 01:11
1
    library(plyr)
    join(data,ddply(data,.(state),summarise,mean=mean(class)),by=("state"),type="left")
Maiasaura
  • 32,226
  • 27
  • 104
  • 108
  • I think it may be simpler to just use `ddply` and `transform` (if I've understood the OP correctly). – joran Jan 04 '12 at 23:44
  • Actually I just did a summary but matched it back to the original data. I suspect the ddply statement alone is sufficient but the OP might it as part of the original data. – Maiasaura Jan 05 '12 at 02:17