Take top n records regardless of equal numbers

Question

In my dataset, I have counted the number of trips that started from bicycles stations per hour (0-23), for about 200 stations.

> head(test)
  Start.station.number hour number
1                31000    0     16
2                31000    1      1
3                31000    2      7
4                31000    3      1
5                31000    4      2
6                31000    5     12

My goal is to get the top ten stations per hour, however, it is likely that many some high popularity stations will have an equal count of trips per hour. For example, if station B, station A, and station C all have the same number of trips and are tied for 10th place at 7 AM, then one of them should be chosen randomly, it doesn't really matter which one is chosen. In R, how would one take the top ten stations by hour, regardless of the some stations having equal counts (this would be a similar function to head(n=10) per hour) ?

Below is a sample of the data

> dput(test)
structure(list(Start.station.number = c(31000L, 31000L, 31000L, 
31000L, 31000L, 31000L, 31000L, 31000L, 31000L, 31000L, 31000L, 
31000L, 31000L, 31000L, 31000L, 31000L, 31000L, 31000L, 31000L, 
31000L, 31000L, 31000L, 31000L, 31000L, 31001L, 31001L, 31001L, 
31001L, 31001L, 31001L, 31001L, 31001L, 31001L, 31001L, 31001L, 
31001L, 31001L, 31001L, 31001L, 31001L, 31001L, 31001L, 31001L, 
31001L, 31001L, 31001L, 31001L, 31002L, 31002L, 31002L), hour = c(0L, 
1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
15L, 16L, 17L, 18L, 19L, 20L, 21L, 22L, 23L, 0L, 2L, 3L, 4L, 
5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 15L, 16L, 17L, 18L, 
19L, 20L, 21L, 22L, 23L, 0L, 1L, 2L), number = c(16L, 1L, 7L, 
1L, 2L, 12L, 27L, 33L, 36L, 41L, 50L, 36L, 39L, 34L, 22L, 38L, 
40L, 27L, 37L, 31L, 16L, 15L, 16L, 8L, 3L, 3L, 1L, 1L, 2L, 15L, 
30L, 74L, 47L, 49L, 40L, 43L, 54L, 51L, 56L, 48L, 99L, 75L, 48L, 
28L, 24L, 14L, 3L, 16L, 18L, 3L)), .Names = c("Start.station.number", 
"hour", "number"), row.names = c(NA, 50L), class = "data.frame")

Order the data set, split it by hour, take the 1st 10 elements? E.g. `test <- test[order(test$number, decreasing=T),];sapply(split(test$Start.station.number, test$hour), head, 10)`? — lukeA, Nov 30 '17 at 22:49

score 2 · Accepted Answer · answered Nov 30 '17 at 23:59

Yet another example using dplyr, this time with slice. I've also added a random variable to ensure the ties are randomly ordered

library(dplyr)
set.seed(1)
df <- tibble(Start.station.number = sample(1:20, 1000, TRUE), 
             hour = sample(0:23, 1000, TRUE), 
             number = sample(1:20, 1000, TRUE),
             random = runif(1000))

df %>% 
  group_by(hour) %>% 
  arrange(-number, random, .by_group = TRUE) %>% 
  slice(1:10)
#> # A tibble: 240 x 4
#> # Groups:   hour [24]
#>    Start.station.number  hour number     random
#>                   <int> <int>  <int>      <dbl>
#>  1                    9     0     20 0.22558247
#>  2                    1     0     19 0.19954013
#>  3                    3     0     19 0.71950937
#>  4                   19     0     19 0.81171457
#>  5                   15     0     19 0.91195428
#>  6                   17     0     18 0.01283631
#>  7                    8     0     18 0.95778679
#>  8                   14     0     17 0.06196987
#>  9                   10     0     17 0.18180938
#> 10                   19     0     17 0.23981459
#> # ... with 230 more rows

score 0 · Answer 2 · answered Nov 30 '17 at 22:56

Here's an example using dplyr that returns 10 exactly:

library(dplyr)
 set.seed(1)

 df <- tibble(Start.station.number = sample(1:20, 1000, T), 
              hour = sample(1:24, 1000, T), 
              number = sample(1:20, 1000, T))

  df %>% group_by(hour) %>% arrange(hour, -number) %>% mutate(rank = row_number()) %>% filter(rank < 11) %>% select(-rank)

If you want ties you can use the convenience wrapper top_n:

 df %>% group_by(hour) %>% top_n(10, number) %>% arrange(hour, -number)

Take top n records regardless of equal numbers

2 Answers2