2

after grouping by q, would then like to extract two max value from two different w e columns respectively

input data:

q <- c(503,503,503,503,503,503,503,503,503,503,503,503,503,510,510,510,510,510,510,510,510,510,510,510,510,525,526,526)
w <- c(56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56,56)
e <- c(26,26,26,26,26,27,28,28,28,28,28,28,28,28,28,28,28,28,28,28,29,30,30,30,30,33,33,33)
r <- data.frame(q,w,e, stringsAsFactors = FALSE)

Code:

r %>% group_by(q) %>% slice(which.max(w & e))

My Output:

  q     w     e
 <dbl> <dbl> <dbl>
1  503.   56.   26.
2  510.   56.   28.
3  525.   56.   33.
4  526.   56.   33.

Expected Output:

    q   w  e
1  503 56 28
2  510 56 30
3  525 56 33
4  526 56 33

would prefer to use %>% and slice command as the above code instead of finding max separately q$w q$e and then merging by q (would like to avoid merge as my actual data is large object.size ~2GB)

beavis11111
  • 576
  • 1
  • 7
  • 19

4 Answers4

5

This is fast data.table solution and will scale well to your 2GB dataset.

library(data.table)
dt <- data.table(r)
dt[, lapply(.SD, max, na.rm=TRUE), by=q ]

Result

    q  w  e
1: 503 56 28
2: 510 56 30
3: 525 56 33
4: 526 56 33

Benchmarking

microbenchmark(data.table = dt[, lapply(.SD, max, na.rm=TRUE), by=q ],
               dplyr1 = r %>% group_by(q) %>% summarise_all(max),
               base = do.call(rbind, by(r, r$q, function(x)
               data.frame(q = unique(x$q), w = max(x$w), e = max(x$e)))), times = 50
)

Result

Unit: microseconds
       expr      min       lq     mean   median       uq       max neval
 data.table  810.240 1060.267 1447.979 1192.107 1332.054 14260.901    50
     dplyr1 1562.027 1686.613 1857.382 1759.574 1869.226  3617.279    50
       base 1925.973 2088.107 2448.162 2226.986 2485.760  7395.837    50

Clearly data.table is fastest.

Rana Usman
  • 1,031
  • 7
  • 21
4

You don't need to use slice, just summarise_all

r %>% group_by(q) %>% summarise_all(max)

# A tibble: 4 x 3
#      q     w     e
#    <dbl> <dbl> <dbl>
# 1   503  56.0  28.0
# 2   510  56.0  30.0
# 3   525  56.0  33.0
# 4   526  56.0  33.0
Relasta
  • 1,066
  • 8
  • 8
2

The following dplyr code does what you need and also avoid using merge():

r %>% 
group_by(q) %>% 
summarize(w=max(w), e=max(e))

Returns:

      q     w     e
  <dbl> <dbl> <dbl>
1  503.   56.   28.
2  510.   56.   30.
3  525.   56.   33.
4  526.   56.   33.
onlyphantom
  • 8,606
  • 4
  • 44
  • 58
2

I know you asked for a tidyverse solution, but as an alternative, here is a solution in base R using by:

do.call(rbind, by(r, r$q, function(x)
    data.frame(q = unique(x$q), w = max(x$w), e = max(x$e))))
#      q  w  e
#503 503 56 28
#510 510 56 30
#525 525 56 33
#526 526 56 33
Maurits Evers
  • 49,617
  • 4
  • 47
  • 68