0

I have a data with primary key and ratio values like the following

2.243164164
1.429242413
2.119270714
3.013427143
1.208634972
1.208634972
1.23657632
2.212136028
2.168583297
2.151961216
1.159886063
1.234106444
1.694206176
1.401425329
5.210125578
1.215267806
1.089189869

I want to add a rank column which groups these ratios in say 3 bins. Functionality similar to the sas code:

PROC RANK DATA = TAB1 GROUPS = &NUM_BINS
Joe
  • 62,789
  • 6
  • 49
  • 67
Lakshmi
  • 1
  • 1
  • 2
    We can use `?cut` and specify `breaks=3` – akrun Oct 21 '15 at 07:49
  • What would be the expected output? –  Oct 21 '15 at 08:46
  • 2
    Yep @akrun solution is great: try `data.frame(Value=values,Bin=as.integer(cut(values,breaks=3)))` – digEmAll Oct 21 '15 at 08:49
  • I tried cut. However the number of observations in the bins are not equal. The expected output would be Bin 1 (1.089189869 1.159886063 1.208634972 1.208634972 1.215267806 1.234106444 ) Bin 2 (1.23657632 1.401425329 1.429242413 1.694206176 2.119270714 2.151961216 ) Bin3 (2.168583297 2.212136028 2.243164164 3.013427143 5.210125578 ) Basically sorted and grouped in bins with equal number of observations – Lakshmi Oct 21 '15 at 08:56

2 Answers2

1

I did the following: Convert your vector to data frame. Create variable Rank:

test2$rank<-rank(test2$test)
> test2
   test    rank
1  2.243164 15.0
2  1.429242  9.0
3  2.119271 11.0
4  3.013427 16.0
5  1.208635  3.5
6  1.208635  3.5
7  1.236576  7.0
8  2.212136 14.0
9  2.168583 13.0
10 2.151961 12.0
11 1.159886  2.0
12 1.234106  6.0
13 1.694206 10.0
14 1.401425  8.0
15 5.210126 17.0
16 1.215268  5.0
17 1.089190  1.0

Define function to convert to percentile ranks and then define pr as that percentile.

percent.rank<-function(x) trunc(rank(x)/length(x)*100)
test3<-within(test2,pr<-percent.rank(rank))

Then I created bins on the fact you wanted 3 of them.

test3$bins <- cut(test3$pr, breaks=c(0,33,66,100), labels=c("0-33","34-66","66-100"))

       test    x rank  pr   bins
1  2.243164 15.0 15.0  88 66-100
2  1.429242  9.0  9.0  52  34-66
3  2.119271 11.0 11.0  64  34-66
4  3.013427 16.0 16.0  94 66-100
5  1.208635  3.5  3.5  20   0-33
6  1.208635  3.5  3.5  20   0-33
7  1.236576  7.0  7.0  41  34-66
8  2.212136 14.0 14.0  82 66-100
9  2.168583 13.0 13.0  76 66-100
10 2.151961 12.0 12.0  70 66-100
11 1.159886  2.0  2.0  11   0-33
12 1.234106  6.0  6.0  35  34-66
13 1.694206 10.0 10.0  58  34-66
14 1.401425  8.0  8.0  47  34-66
15 5.210126 17.0 17.0 100 66-100
16 1.215268  5.0  5.0  29   0-33
17 1.089190  1.0  1.0   5   0-33

That work for you?

akaDrHouse
  • 2,190
  • 2
  • 20
  • 29
0

Almost late but given your data, we can use ntile from dplyr package to get equal sized groups:

df <- data.frame(values = c(2.243164164,
                   1.429242413,
                   2.119270714,
                   3.013427143,
                   1.208634972,
                   1.208634972,
                   1.23657632,
                   2.212136028,
                   2.168583297,
                   2.151961216,
                   1.159886063,
                   1.234106444,
                   1.694206176,
                   1.401425329,
                   5.210125578,
                   1.215267806,
                   1.089189869))
library(dplyr)

df <- df %>%
  arrange(values) %>%
  mutate(rank = ntile(values, 3))
    
    
     values rank
1  1.089190    1
2  1.159886    1
3  1.208635    1
4  1.208635    1
5  1.215268    1
6  1.234106    1
7  1.236576    2
8  1.401425    2
9  1.429242    2
10 1.694206    2
11 2.119271    2
12 2.151961    2
13 2.168583    3
14 2.212136    3
15 2.243164    3
16 3.013427    3
17 5.210126    3

Or see cut_number from ggplot2 package:

library(ggplot2)
df$rank2 <- cut_number(df$values, 3, labels = c(1:3))

     values rank rank2
1  1.089190    1     1
2  1.159886    1     1
3  1.208635    1     1
4  1.208635    1     1
5  1.215268    1     1
6  1.234106    1     1
7  1.236576    2     2
8  1.401425    2     2
9  1.429242    2     2
10 1.694206    2     2
11 2.119271    2     2
12 2.151961    2     3
13 2.168583    3     3
14 2.212136    3     3
15 2.243164    3     3
16 3.013427    3     3
17 5.210126    3     3

Because your sample consists of 17 numbers, one bin consists of 5 numbers while the others consist of 6 numbers. There are differences for row 12: ntile assigns 6 numbers to the first and second group, whereas cut_number assigns them to the first and third group.

> table(df$rank)
1 2 3 
6 6 5 

> table(df$rank2)
1 2 3 
6 5 6 

See also here: Splitting a continuous variable into equal sized groups

Apl4n1
  • 27
  • 5