4

Suppose that i have following data

ID  Category    Price   Month
1   X            2       1
1   X            2       2
1   X            2       3
1   X            2       4
2   X            3       1
2   X            3       2
2   X            3       3
2   X            3       4
3   X            1       1
3   X            1       2
3   X            1       3
3   X            1       4
4   X            10      1
4   X            10      2
4   X            10      3
4   X            10      4
5   Y            5       1
5   Y            5       2
5   Y            5       3
5   Y            5       4
6   Y            2       1
6   Y            2       2
6   Y            2       3
6   Y            2       4
7   Y            1       1
7   Y            1       2
7   Y            1       3
7   Y            1       4
8   Y            10      1
8   Y            10      2
8   Y            10      3
8   Y            10      4

There are different prices for products in a particular category, some are low priced and some high priced. I want to have a new variable "Price Level" which shows whether the product is low priced product, medium priced product or high priced product.

Levels are defined as follows. It takes the prices of all the products in a particular category and divide into 4 percentiles.

  • 1st Percentile = Low
  • 2nd and 3rd percentile = Medium
  • 4th percentile = High

So the table will look like this

ID  Category    Price   Month   Price Level
1   X            4       1        Medium
1   X            4       2        Medium
1   X            4       3        Medium
1   X            4       4        Medium
2   X            3       1        Medium
2   X            3       2        Medium
2   X            3       3        Medium
2   X            3       4        Medium
3   X            1       1        Low
3   X            1       2        Low
3   X            1       3        Low
3   X            1       4        Low
4   X            10      1        High
4   X            10      2        High
4   X            10      3        High
4   X            10      4        High
5   Y            5       1        Medium
5   Y            5       2        Medium
5   Y            5       3        Medium
5   Y            5       4        Medium
6   Y            2       1        Low
6   Y            2       2        Low
6   Y            2       3        Low
6   Y            2       4        Low
7   Y            1       1        Low
7   Y            1       2        Low
7   Y            1       3        Low
7   Y            1       4        Low
8   Y            10      1        Low
8   Y            10      2        Low
8   Y            10      3        Low
8   Y            10      4        Low
Jay khan
  • 745
  • 2
  • 9
  • 22

2 Answers2

1

You can lapply across the data.frame split by Category, and call cut and quantile on each group. data.frame and do.call(rbind, reassemble the data back into a single data.frame:

do.call(rbind, lapply(split(df, df$Category), function(x){
    data.frame(x, Price_Level =  cut(x$Price, 
                                     quantile(x$Price, probs = c(0, .25, .75, 1)), 
                                     labels = c('Low', 'Medium', 'High'), 
                                     include.lowest = TRUE))
}))

#    ID Category Price Month Price_Level
# 1   1        X     2     1      Medium
# 2   1        X     2     2      Medium
# 3   1        X     2     3      Medium
# 4   1        X     2     4      Medium
# 5   2        X     3     1      Medium
# 6   2        X     3     2      Medium
# 7   2        X     3     3      Medium
# 8   2        X     3     4      Medium
# 9   3        X     1     1         Low
# 10  3        X     1     2         Low
# 11  3        X     1     3         Low
# 12  3        X     1     4         Low
# 13  4        X    10     1        High
# 14  4        X    10     2        High
# 15  4        X    10     3        High
# 16  4        X    10     4        High
# 17  5        Y     5     1      Medium
# 18  5        Y     5     2      Medium
# 19  5        Y     5     3      Medium
# 20  5        Y     5     4      Medium
# 21  6        Y     2     1      Medium
# 22  6        Y     2     2      Medium
# 23  6        Y     2     3      Medium
# 24  6        Y     2     4      Medium
# 25  7        Y     1     1         Low
# 26  7        Y     1     2         Low
# 27  7        Y     1     3         Low
# 28  7        Y     1     4         Low
# 29  8        Y    10     1        High
# 30  8        Y    10     2        High
# 31  8        Y    10     3        High
# 32  8        Y    10     4        High

If you just want to return a single column, but don't want to worry about grouping messing up your order, you can use the equivalent

factor(ave(df$Price, df$Category, FUN = function(x){
    cut(x, 
        quantile(x, probs = c(0, .25, .75, 1)), 
        include.lowest = TRUE)
}), levels = c(1, 2, 3), labels = c('Low', 'Medium', 'High'))

A slightly less ugly version with dplyr:

library(dplyr)
df %>% group_by(Category) %>% mutate(Price_Level = cut(Price, 
                                                       quantile(Price, c(0, .25, .75, 1)), 
                                                       labels = c('Low', 'Medium', 'High'), 
                                                       include.lowest = TRUE))
alistaire
  • 42,459
  • 4
  • 77
  • 117
  • I tried to apply the code on original data which does not have equal number of rows in each category and month so i got the error " 'breaks' are not unique" so i tried to put quantile inside unique function and then i got the error "lengths of 'breaks' and 'labels' differ". How to fix it? – Jay khan Mar 20 '16 at 05:18
  • Can you edit with a subset of data that causes the issue? – alistaire Mar 20 '16 at 05:23
1

We can use data.table

library(data.table)
setDT(df)[, Price_Level := cut(Price, 
             quantile(Price, c(0, .25, .75, 1)), 
            labels = c('Low', 'Medium', 'High'),
            include.lowest = TRUE), by = Category]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I tried to apply the code on original data which does not have equal number of rows in each category and month so i got the error " 'breaks' are not unique" so i tried to put quantile inside unique function and then i got the error "lengths of 'breaks' and 'labels' differ". How to fix it? – Jay khan Mar 20 '16 at 05:18
  • @Jaykhan In the example data, it is working. – akrun Mar 20 '16 at 05:23