2

I am having a data set:

Security      %market value      return    Quintile*
1             0.07               100        3
2             0.10               88         2
3             0.08               78         1
4             0.12               59         1
5             0.20               106        4
6             0.04               94         3
7             0.05               111        5
8             0.10               83         2
9             0.06               97         3
10            0.03               90         3
11            0.15               119        5

the actual data set is having more than 5,000 rows, and I would like to use R to create 5 quintiles, each quintile is suppose to have 20% of market value. In addition, they have to be ranked in the order of magnitude of return. That is, 1st quintile should contain the 20% securities with the lowest return value, 5th quintile should contain the 20% securities with the highest return value. I would like to create the column "Quintile", among different quintiles there can be different numbers of securities but total %market value should be same. I have tries several methods and I am very new to R, so please kindly provide me some help. Thank you very much in advance!

Samuel

Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
Samuel Song
  • 2,075
  • 3
  • 18
  • 17
  • 1
    So, [***what have you tried?***](http://mattgemmell.com/2008/12/08/what-have-you-tried/). – Simon O'Hanlon May 23 '13 at 21:33
  • @SimonO101 Thank yo very much for helping me change the format. This is the first time I post a question, and really new to it. – Samuel Song May 23 '13 at 21:39
  • @SimonO101 I have tried to use a loop: for (c in 1:nrow(raw_data)) { mv_sum <- mv_sum + raw_data[c,"mv_pct"] raw_data[c,"mv_sum"] <- mv_sum if (mv_sum < 0.20) { raw_data[c,"cap_quintile"] <- 1 mv_sum <- 0 } } but this can only create 1 quintile at a time, and takes a lot of time. – Samuel Song May 23 '13 at 21:40
  • Ok, great! Post that into the question so people can see you have a method you attempted. Generally on here, when asking how to do something it's always good to show that you have already tried to do something, and even better to include a [reproducible example](http://stackoverflow.com/q/5963269/1478381) that shows your input (+1 for including this) and your expected output (which you also showed in the `Quintile` column) and how you have set about trying to do this. – Simon O'Hanlon May 23 '13 at 21:51

2 Answers2

1

You can order your data and then use findInterval (adding a small delta to use closed right sided braces):

raw_data <- raw_data[order(raw_data$return),]
raw_data$Q2 <- findInterval( cumsum(raw_data$marketvalue) , seq(0,1,length=5)+0.000001 , right = T )
raw_data
#   Security marketvalue return Quintile Q2
#4         4        0.12     59        1  1
#3         3        0.08     78        1  1
#8         8        0.10     83        2  2
#2         2        0.10     88        2  2
#10       10        0.03     90        3  3
#6         6        0.04     94        3  3
#9         9        0.06     97        3  3
#1         1        0.07    100        3  3
#5         5        0.20    106        4  4
#7         7        0.05    111        5  5
#11       11        0.15    119        5  5
Simon O'Hanlon
  • 58,647
  • 14
  • 142
  • 184
  • Thank you for your suggestion. However, I believe you have misunderstood my question a little bit. When I say 20% of securities, I am not saying 20% number of securities but 20% market value. Each quintile should be weighted by the value in column "market value" but not number of securities. In an other word, all securities in the same quintile group should have a %market value add up to exactly 20%. That's why your result is slightly different from mine. Please let me know if you still have any questions about me statement. Thank you! – Samuel Song May 23 '13 at 23:53
  • sorry for the late reply. I still have question about your code. Why are you using "df$marketvalue" instead of "raw_data$marketvalue"? where is this "df" coming from? when I was trying to use this code, there is an error of: "object of type 'closure' is not subsettable" how can I solve this? – Samuel Song May 24 '13 at 19:11
  • @user2415342 spelling mistake. Since you didn't provide data through the usual method of `dput` I made a data.frame myself and called it `df`. Just change `df` to `raw_data` (I have edited the answer to reflect this). Please let me know if this fixes your problem. Cheers! – Simon O'Hanlon May 24 '13 at 21:47
  • Sorry again for the late reply, this works perfect for my case. Thank you very much! – Samuel Song Jun 05 '13 at 22:00
1

The following works with your data.

First, sort by increasing return:

dat <- dat[order(dat$return), ]

Then, compute the cumulative market share and cut every 0.2:

dat$Quintile <- ceiling(cumsum(dat$market) / 0.2)

Finally, sort things back by Security:

dat <- dat[order(dat$Security), ]
flodel
  • 87,577
  • 21
  • 185
  • 223
  • I am not sure, but it seems like I cannot get a correct solution by using 'ceiling' command... – Samuel Song Jun 05 '13 at 22:00
  • @user2415342: Since it works on your example data, you will have to be more specific regarding *i cannot get a correct solution*. Even better would be to provide a new data sample that shows why it is not correct. Thanks. – flodel Jun 05 '13 at 22:43
  • thank you for your advise, i will be more specific for the next time. Currently I am good with the solution provided above by Simon – Samuel Song Jun 06 '13 at 23:03