14

Based off of a previous question I asked, which @Andrie answered, I have a question about the usage of the cut function and labels.

I'd like get summary statistics based on the range of number of times a user logs in.

Here is my data:

  # Get random numbers
  NumLogin <- round(runif(100,1,50))

  # Set the login range     
  LoginRange <- cut(NumLogin, 
       c(0,1,3,5,10,15,20,Inf), 
       labels=c('1','2','3-5','6-10','11-15','16-20','20+')
       )

Now I have my LoginRange, but I'm unsure how the cut function actually works. I want to find users who have logged in 1 time, 2 times, 3-5 times, etc, while only including the user if they are in that range. Is the cut function including 3 twice (In the 2 bucket and the 3-5 bucket)? If I look in my example, I can see a user who logged in 3 times, but they are cut as '2'. I've looked at the documentation and every R book I own, but no luck. What am I doing wrong?

Also - As a usage question - should I attach the LoginRange to my data frame? If so, what's the best way to do so?

DF <- data.frame(NumLogin, LoginRange)

?

Thanks

wibeasley
  • 5,000
  • 3
  • 34
  • 62
mikebmassey
  • 8,354
  • 26
  • 70
  • 95
  • 1
    Re: your final question, if you have a pre-existing data.frame `DF`, you can attach the `LoginRange` to it by doing `DF$LoginRange <- LoginRange`. Whether you want to do that is up to you. Is that what you were asking? – Josh O'Brien Nov 22 '11 at 22:09

1 Answers1

19

The intervals defined by the cut() function are (by default) closed on the right. To see what that means, try this:

cut(1:2, breaks=c(0,1,2))
# [1] (0,1] (1,2]

As you can see, the integer 1 gets included in the range (0,1], not in the range (1,2]. It doesn't get double-counted, and for any input value falling outside of the bins you define, cut() will return a value of NA.

When dealing with integer-valued data, I tend to set break points between the integers, just to avoid tripping myself up. In fact, doing this with your data (as shown below), reveals that the 2nd and 3rd bins were actually incorrectly named, which illustrates the point quite nicely!

LoginRange <- cut(NumLogin, 
   c(0.5, 1.5, 3.5, 5.5, 10.5, 15.5, 20.5, Inf),
   # c(0,1,3,5,10,15,20,Inf) + 0.5, 
   labels=c('1','2-3','4-5','6-10','11-15','16-20','20+')
   )
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • 6
    I prefer the intervals to be closed on the left, so use cut2 from the Hmisc package. (Easier than always typing `,right=FALSE,`.) – IRTFM Nov 22 '11 at 21:30
  • @DWin -- Thanks for the pointer. It looks like `cut2` can do a lot of other interesting things besides, like cutting into quantile groups, or setting breakpoints to ensure a minimum number of observations per group. – Josh O'Brien Nov 22 '11 at 21:34
  • Yeah, right, ... I should have added also easier than typing `breaks= quantile( varname, probs=(0:10)/10, na.rm=TRUE )` and forgetting the argument names and `na.rm` the first two times. – IRTFM Nov 22 '11 at 22:26
  • 2
    `cut_number` and `cut_interval` from `ggplot2` are also useful shortcuts that cover similar ground – Ben Bolker Nov 23 '11 at 01:22