3

Apologies is this is something a more seasoned R user would know, but I just came across this and wanted to ask about proper usage.

It appears to be possible to classify ranges for variables by using as.factor. So, I could group observations into a range. For example, if I were looking at visits by user, it looks that I could write an if/then statement to bin the users by the range of visits they had, then get summary statistics based on the group.

Here is the link where I learned about this: http://programming-r-pro-bro.blogspot.com/2011/10/modelling-with-r-part-2.html

Now, while this function looks easier than grouping data by using plyr and ddply, it does not look to be powerful enough to break the variable into X number of bins (for example 10 for a decile) - You would have to do that yourself.

This leads to my question - Is one better than the other for grouping data, or are there just many ways to tackle grouping like this?

Thanks

joran
  • 169,992
  • 32
  • 429
  • 468
mikebmassey
  • 8,354
  • 26
  • 70
  • 95
  • 2
    `as.factor` simply converts a character vector into a factor - it does no analysis by itself. `ddply` is one of the powerful tools in the suite provided by `plyr`. Comparing `as.factor` to `ddply` is a bit like comparing a ball bearing to a gearbox. – Andrie Oct 31 '11 at 17:40
  • 2
    You might want to take a look at `?cut`. You might also want to take a look at http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example With a small example of what you want to do, people will more easily chime in and show you some easier ways to do it. `cut()` is one of them. – Joris Meys Oct 31 '11 at 17:42
  • 1
    Sorry @Joris, I really should let people improve their question before answering. – Aaron left Stack Overflow Oct 31 '11 at 17:44

1 Answers1

8

I think cut is a better tool for this.

With some sample data:

set.seed(123)
age <- round(runif(10,20,50))

This is what I'd do:

> cut(age, c(0,30,40,Inf))
 [1] (0,30]   (40,Inf] (30,40]  (40,Inf] (40,Inf] (0,30]   (30,40]  (40,Inf]
 [9] (30,40]  (30,40] 
Levels: (0,30] (30,40] (40,Inf]

Optionally, setting the factor labels manually:

> cut(age, c(0,30,40,Inf), labels=c('0-30', '31-40', '40+'))
 [1] 0-30  40+   31-40 40+   40+   0-30  31-40 40+   31-40 31-40
Levels: 0-30 31-40 40+

To contrast, the linked page suggests this:

> as.factor(ifelse(age<=30, '0-30', ifelse(age <= 40, '30-40', '40+')))
 [1] 0-30  40+   30-40 40+   40+   0-30  30-40 40+   30-40 30-40
Levels: 0-30 30-40 40+
Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142
  • 1
    'cut' is definitely better than that ifelse approach illustrated in the linked page. Be aware of the include.lowest argument to 'cut'. – IRTFM Oct 31 '11 at 20:20
  • Thanks for the suggestions. I see why this is probably a better way to approach it. – mikebmassey Nov 01 '11 at 00:02