How to select the 10% of highest and lowest values from a vector in R?

Question

As introduce in the title, I would like to select the 10% highest and the 10% lowest values from a vector. How can I manage to do that?

Anyone can help me ? Thanks a lot

also, have a look at `?quantile` – Ricardo Saporta Sep 30 '13 at 16:51 — Ricardo Saporta, Sep 30 '13 at 16:51

score 5 · Answer 1 · answered Sep 30 '13 at 16:34

5

This is an example that takes roughly 10%:

v <- rnorm(100)
sort(v)[1:(length(v)/10)]                  # lowest, in increasing order.
sort(v, decreasing=TRUE)[1:(length(v)/10)] # highest, in decreasing order.

answered Sep 30 '13 at 16:34

PascalVKooten

20,643
17
103
160

Sure you are right, I am a fresh user of R, so maybe some basic questions also seems difficult for me, anyway your comments is all right, I should improve and learn more. Thanks. – Oscar-fr Sep 30 '13 at 16:40
@Oscar-fr FYI - Simple questions are fine, generally speaking. Most of us _do_ like helping new R users. What's frustrating, though, is when people ask us "How do I do X?" and provide a specification for a task, but no code that demonstrates what you've tried. In the future, make sure you _try something first_ and then share what you tried in your question. – joran Sep 30 '13 at 16:47
Here is some suggested reading before future questions: [this](http://meta.stackoverflow.com/help/how-to-ask), [this](http://meta.stackexchange.com/questions/156810/stack-overflow-question-checklist) and [this](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/5963610#5963610). Welcome to stackoverflow! – Henrik Sep 30 '13 at 16:54
Why not just sort once and then use `head` and `tail` to grab the values from both ends? – Greg Snow Sep 30 '13 at 17:32
@GregSnow Don't you still need to find the 10%? I just learned that `head` has the `n` argument, but how does that get you first **10%**? – PascalVKooten Sep 30 '13 at 17:49
Also, if the user wants to obtain both at the same time, then yes, storing and using head and tail might make more sense. However, if memory is an issue then one might do it the other way? So I guess that part of your comment is not so important. – PascalVKooten Sep 30 '13 at 17:51
@Dualinity, I was just thinking that `tail( sort(v), length(v)/10 )` would be simpler than doing a second sort in descending order and I think that `head` and `tail` are a little clearer than creating the sequence using `:`, though that preference can easily vary between people. – Greg Snow Sep 30 '13 at 17:55
I agree completely with you there. It matters on the specifics of the situation. This way just shows how sort can be used to sort in both directions. – PascalVKooten Sep 30 '13 at 17:59
@GregSnow, I liked your head/tail approach (and Dualinity's, +1). Not that it matters here (I assume) but I just noted that `head` and `tail` seem to handle a non-integer `n` differently. `x <- sample(1:13)`; `head(sort(x), length(x)/10)`; `tail(sort(x), length(x)/10)`. – Henrik Sep 30 '13 at 18:12
@Henrik, yes, internally `head` uses `seq_len` and `tail` uses `seq.int` which round differently. If it is an issue then use `round`, `floor`, or `ceiling` before passing to `head`/`tail`. – Greg Snow Sep 30 '13 at 18:20
@GregSnow, thanks a lot for your explanation! No issue so far, I was just curious since it was the first time I tried a non-integer `n`. I will keep your suggestions in mind though. Cheers. – Henrik Sep 30 '13 at 18:23

score 3 · Answer 2 · answered Sep 30 '13 at 16:38

This will return a vector containing the bottom and top 10% of x:

> set.seed(123)
> x<-rnorm(100)
> x[{q<-rank(x)/length(x);q<0.1 | q>=0.9}]
 [1]  1.558708  1.715065 -1.265061  1.786913 -1.966617 -1.686693 -1.138137
 [8]  1.253815 -1.265396  2.168956 -1.123109  1.368602  1.516471 -1.548753
[15]  2.050085 -2.309169 -1.220718  1.360652  2.187333  1.532611

score 1 · Answer 3 · answered Sep 30 '13 at 17:30

Note that sorting can be quite slow. For small vectors you won't notice this much, but if you want to do this for very large vectors then sorting the entire vector can be very slow and you don't need to fully sort the vector.

Look at the partial argument on the help page for sort and sort.int for how to do a partial sort which can still give you the top and bottom 10% without needing to do a full sort (the quantile function uses partial sorting internally, so should be faster in some cases than the full sort, but doing the partial sort yourself can eliminate some of the quantile overhead and give a bit more speed as well).

How to select the 10% of highest and lowest values from a vector in R?

3 Answers3