4

New R user. I'm trying to split a dataset based on deciles, using cut according to the process in this question. I want to add the decile values as a new column in a dataframe, but when I do this the lowest value is listed as NA for some reason. This happens regardless of whether include.lowest=TRUE or FALSE. Anyone have any idea why?

Happens when I use this sample set, too, so it's not exclusive to my data.

data <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)

> decile <- cut(data, quantile(data, (0:10)/10, labels=TRUE, include.lowest=FALSE))

> df <- cbind(data, decile)

> df

      data decile
 [1,]    1     NA
 [2,]    2      1
 [3,]    3      2
 [4,]    4      2
 [5,]    5      3
 [6,]    6      3
 [7,]    7      4
 [8,]    8      4
 [9,]    9      5
[10,]   10      5
[11,]   11      6
[12,]   12      6
[13,]   13      7
[14,]   14      7
[15,]   15      8
[16,]   16      8
[17,]   17      9
[18,]   18      9
[19,]   19     10
[20,]   20     10
Community
  • 1
  • 1
mikemalloy
  • 53
  • 1
  • 6

1 Answers1

4

There are two problems, first you have a couple of things wrong with your cut call. I think you meant

cut(data, quantile(data, (0:10)/10), include.lowest=FALSE)
##                                ^missing parenthesis

Also, labels should be FALSE, NULL, or a vector of length(breaks) containing the required labels.

Second, the main issue is that because you set include.lowest=FALSE, and data[1] is 1, which corresponds to the first break as defined by

> quantile(data, (0:10)/10)
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
 1.0  2.9  4.8  6.7  8.6 10.5 12.4 14.3 16.2 18.1 20.0

the value 1 doesn't fall into any category; it is beyond the lower limit of the categories defined by your breaks.

I'm not sure what you want, but you could try one of these two alternatives, depending on which class you want 1 to be in:

> cut(data, quantile(data, (0:10)/10), include.lowest=TRUE)
 [1] [1,2.9]     [1,2.9]     (2.9,4.8]   (2.9,4.8]   (4.8,6.7]   (4.8,6.7]  
 [7] (6.7,8.6]   (6.7,8.6]   (8.6,10.5]  (8.6,10.5]  (10.5,12.4] (10.5,12.4]
[13] (12.4,14.3] (12.4,14.3] (14.3,16.2] (14.3,16.2] (16.2,18.1] (16.2,18.1]
[19] (18.1,20]   (18.1,20]  
10 Levels: [1,2.9] (2.9,4.8] (4.8,6.7] (6.7,8.6] (8.6,10.5] ... (18.1,20]
> cut(data, c(0, quantile(data, (0:10)/10)), include.lowest=FALSE)
 [1] (0,1]       (1,2.9]     (2.9,4.8]   (2.9,4.8]   (4.8,6.7]   (4.8,6.7]  
 [7] (6.7,8.6]   (6.7,8.6]   (8.6,10.5]  (8.6,10.5]  (10.5,12.4] (10.5,12.4]
[13] (12.4,14.3] (12.4,14.3] (14.3,16.2] (14.3,16.2] (16.2,18.1] (16.2,18.1]
[19] (18.1,20]   (18.1,20]  
11 Levels: (0,1] (1,2.9] (2.9,4.8] (4.8,6.7] (6.7,8.6] ... (18.1,20]
Gavin Simpson
  • 170,508
  • 25
  • 396
  • 453
  • Thanks! The problem was the incorrect parenthesis placement, I couldn't get it to work with include.lowest=TRUE or FALSE before. – mikemalloy Jul 31 '13 at 15:29