11

I am trying to understand how cut divides and creates intervals; tried ?cut but can't be able to figure out how cut in r works.
Here is my problem:

set.seed(111)
data1 <- seq(1,10, by=1)
data1 
[1]  1  2  3  4  5  6  7  8  9 10
data1cut<- cut(data1, breaks = c(0,1,2,3,5,7,8,10), labels = FALSE)
data1cut
[1] 1 2 3 4 4 5 5 6 7 7

1. Why did 8,9,10 not included in data1cut result?
2. why did summary(data1) and summary(data1cut) produces different result?

summary(data1)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.00    3.25    5.50    5.50    7.75   10.00 

summary(data1cut)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
1.00    3.25    4.50    4.40    5.75    7.00  

How should i better use cut so that i can create say 4 bins based on the results of summary(data1)?

bin1 [1 -3.25]
bin2 (3.25 -5.50]
bin3 (5.50 -7.75]
bin4 (7.75 -10] 

Thank you.

epo3
  • 2,991
  • 2
  • 33
  • 60
deepseefan
  • 3,701
  • 3
  • 18
  • 31
  • why do you think some of the values are not included ? what did you expect as the `cut` result ? maybe try `cut` on values that are not the first integers to avoid confusion, and read carefully the paragraph **Value** from `?cut` : *A factor is returned, unless labels = FALSE which results in an integer vector of level codes.* – Cath Aug 24 '16 at 12:45
  • may be my understanding of `cut` is very limited. What i expect when i do a `cut` was a bin in the ranges created and corresponding values(factors) in the bins? So, assumed distribution metric results such as `summary(data1cut)` and `summary(data1)` be similar? – deepseefan Aug 24 '16 at 13:31
  • 1
    What cut does is indeed put your data into bins and, for each of your vector values, it gives the "code" of the associated bins. You can do `table(data1, data1cut)` to better understand which value falls in which bin – Cath Aug 24 '16 at 13:48
  • Thank you and it is making sense now; but if we do a `boxplot(data1)` and `boxplot(data1cut)`, we get different quartile and median visualization. How should one go about justifying the two plots are same (assuming they mean the same thing again)? – deepseefan Aug 24 '16 at 14:11
  • 1
    no justification needed, they are not the same. You should use `labels=paste0("bin", 1:7)` in the `cut` call, it may make it clearer to you. By the way, there is absolutely no need for a `set.seed` call here – Cath Aug 24 '16 at 14:15

1 Answers1

15

cut in your example splits the vector into the following parts: 0-1 (1); 1-2 (2); 2-3 (3); 3-5 (4); 5-7 (5); 7-8 (6); 8-10 (7)

The numbers in brackets are default labels assigned by cut to each bin, based on the breaks values provided.

cut by default is exclusive of the lower range. If you want to change that then you need to specify it in the include.lowest argument.

  1. You did not assign labels and default argument in this function is FALSE so an integer vector of level codes (in brackets) is used instead.

  2. summary(data1) is a summary of raw data and summary(data1cut) is a summary of your splits.

You can get the split you need using:

data2cut<- 
  cut(data1, breaks = c(1, 3.25, 5.50, 7.75, 10),
      labels = c("1-3.25", "3.25-5.50", "5.50-7.75", "7.75-10"),
      include.lowest = TRUE)

The result is the following:

> data2cut

 [1] 1-3.25    1-3.25    1-3.25    3.25-5.50 3.25-5.50 5.50-7.75 5.50-7.75 7.75-10   7.75-10  
[10] 7.75-10  
Levels: 1-3.25 3.25-5.50 5.50-7.75 7.75-10

I hope it's clear now.

camille
  • 16,432
  • 18
  • 38
  • 60
epo3
  • 2,991
  • 2
  • 33
  • 60
  • Many thanks, and all except the first line is clear. Why did the bin at 3-5 holds only value `4` and not `5`; 5-7 only `5` but not `6 , 7` (because the upper values are included by default in the bin and assumed bin 8-10 holds `9, 10`; but obviously that is not the case) the same goes for the other bins? Could you elaborate a bit how `cut` splits and returns a factor? Are there any lose of data in the process? Thanks again. – deepseefan Aug 24 '16 at 13:46
  • 1
    bin 3-5 holds both `4` and `5`. When you use `data1cut[5]` it will return `4` which is the fourth bin, i.e. bin3-5. Please re-read @Cath 's responses, she explained `cut` very clearly. I think you are confusing the actual values with default bin labels. – epo3 Aug 24 '16 at 16:02