Highest and lowest 100 observations for sorted multiple columns in R

Question

I came up with a more optimal semi-solution. I have sorted my dataframe by Sector and Volume.

df <- structure(list(Customer = structure(1:17, .Label = c("A", "B", 
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", 
"P", "Q"), class = "factor"), Sector = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L), .Label = c("Aviation", 
"Biotech", "Construction"), class = "factor"), Volume = c(-5000L, 
-3000L, 4000L, 6000L, 7000L, 9000L, -4000L, -1500L, 2000L, 3000L, 
5000L, 6000L, -7000L, -4000L, 5000L, 7000L, 8000L)), 
class = "data.frame", row.names = c(NA,-17L))

EDITED:

## > df
##   Customer  Sector      Volume
##     A      Aviation     - 5000
##     B      Aviation     - 3000
##     C      Aviation       4000
##     D      Aviation       6000
##     E      Aviation       7000
##     F      Aviation       9000
##     G      Biotech      - 4000
##     H      Biotech      - 1500
##     I      Biotech        2000
##     J      Biotech        3000
##     K      Biotech        5000
##     L      Biotech        6000
##     M      Construction - 7000
##     N      Construction - 4000
##     O      Construction   5000
##     P      Construction   7000
##     Q      Construction   8000

Let's say I would like to leave the highest and lowest 2 customers per sector. So, my final table should look like this:

## > df
##   Customer  Sector      Volume
##     A      Aviation     - 5000
##     B      Aviation     - 3000
##     E      Aviation       7000
##     F      Aviation       9000
##     G      Biotech      - 4000
##     H      Biotech      - 1500
##     K      Biotech        5000
##     L      Biotech        6000
##     M      Construction - 7000
##     N      Construction - 4000
##     P      Construction   7000
##     Q      Construction   8000

The only difference is that I would like to see highest/lowest 100 customers per sector in my case instead of just 2.

I'm afraid I don't understand your question enough. What do you mean by Columns A to C have been ordered? Could you please show what your desired output would be based on the example you gave? — DJJ, Apr 07 '20 at 10:38
Thank you for your response ! Unfortunately, I am not sure I can paste a table with that many rows. The idea is that columns A, B, C are sorted in ascending order and contain many values. I would like to cut the rest and leave only the 100 highest and lowest for each column. In the end I expect to have a table where column A is populated with 200 observations (out of all), column B populated with 200 observations, etc. I am not sure if now makes more sense. — krasi, Apr 07 '20 at 11:39
Let me put it that way.Column A is the aviation sector, column B is the Biotech one and column C is Construction. I have 500 customers for each sector, 1500 in total. Each of the customers could belong to only one of them, this way the others are missing. In my current dataset all customers which belong to the Aviation sector are sorted smallest to largest (all other rows are NAs) until row 500. Then all customers in the Biotech sector are sorted from largest to smallest until row 1000 (columns A and B are missing). The final 500 rows are the customers from the Construction (A and B missing). — krasi, Apr 07 '20 at 11:59
So in the end I would like to have 200 rows for each sector (top 100 smallest/largest for Aviation first, top 100 smallest/largest for Biotech next and the same for construction). 600 rows in total. Hope this helps. — krasi, Apr 07 '20 at 12:01
You don't have to output all the rows just the first 10 would be enough. If your data are sensitive you can try a dummy table just enough to illustrate your issue. I'll suggest the following [guideline](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). What would be your desired output from your example? — DJJ, Apr 07 '20 at 12:08

Ronak Shah · Answer 1 · 2020-04-07T13:03:18.507

Since each column is sorted you could remove NA values with na.omit and use head and tail to get top and bottom 100 values.

sapply(df[-1], function(x) {x1 <- na.omit(x);c(head(x1, 100), tail(x1, 100))})

Or similarly using apply with MARGIN = 2

apply(df[-1], 2, function(x) {x1 <- na.omit(x);c(head(x1, 100), tail(x1, 100))})

We can also create index to subset :

sapply(df[-1], function(x) 
        {x1 <- na.omit(x);x1[c(1:100,(length(x1) - 100):length(x1))]})

EDIT

For the updated data we can use slice from dplyr.

library(dplyr)
df %>% group_by(Sector) %>% slice(c(1:2, (n() -1):n()))


#  Customer Sector       Volume
#   <fct>    <fct>         <int>
# 1 A        Aviation      -5000
# 2 B        Aviation      -3000
# 3 E        Aviation       7000
# 4 F        Aviation       9000
# 5 G        Biotech       -4000
# 6 H        Biotech       -1500
# 7 K        Biotech        5000
# 8 L        Biotech        6000
# 9 M        Construction  -7000
#10 N        Construction  -4000
#11 P        Construction   7000
#12 Q        Construction   8000

Or another way using top_n.

bind_rows(df %>% group_by(Sector) %>% top_n(2, Volume),
          df %>% group_by(Sector) %>% top_n(-2, Volume)) %>%
arrange(Sector)

score 1 · Answer 2 · answered Apr 07 '20 at 13:21

Using the library(data.table) you achieve your desired output using the following:

library(data.table)
# convert the data.frame into a data.table
setDT(df) 
# sort the data.table by Volume
setkey(df,Volume)
# rbind the  smallest 2 volumes by sector and with the highest 
# 2 volumes by sector
rbind(df[,tail(.SD,2),Sector],
      df[,head(.SD,2),Sector])[order(Customer,Sector)]


##           Sector Customer Volume
##  1:     Aviation        A  -5000
##  2:     Aviation        B  -3000
##  3:     Aviation        E   7000
##  4:     Aviation        F   9000
##  5:      Biotech        G  -4000
##  6:      Biotech        H  -1500
##  7:      Biotech        K   5000
##  8:      Biotech        L   6000
##  9: Construction        M  -7000
## 10: Construction        N  -4000
## 11: Construction        P   7000
## 12: Construction        Q   8000

Highest and lowest 100 observations for sorted multiple columns in R

2 Answers2