Summarize with mathematical conditions in dplyr

Question

Building on this question: Summarize with conditions in dplyr I would like to use dplyr to summarize a column based on a mathematical condition (not string matching as in the linked post). I need to find the maximum measurement when the ratio of measurement/time is the highest, while creating a new column ratio. I'd also like to carry through the entire row, which I'm unsure how to do with dplyr's summarize function.

Example Data Frame

print(df)

   sample     type time measurement
1       a bacteria   24     0.57561
2       a bacteria   44     1.67236
3       a bacteria   67     4.17100
4       a bacteria   88    11.51661
5       b bacteria   24     0.53269
6       b bacteria   44     1.24942
7       b bacteria   67     5.72147
8       b bacteria   88    11.04017
9       c bacteria    0     0.00000
10      c bacteria   24     0.47418
11      c bacteria   39     1.06286
12      c bacteria   64     3.59649
13      c bacteria   78     7.05190
14      c bacteria  108     7.27060

Desired Output

  sample     type time measurement      ratio
1      a bacteria   88    11.51661 0.13087057
2      b bacteria   88    11.04017 0.12545648
3      c bacteria   78     7.05190 0.09040897

Failed Attempt

This only returns the two columns as defined by the group_by and summarize function, would like to have the entire row information carry through:

library(dplyr)
df %>% 
    group_by(sample) %>%
    summarize(ratio = max(measurement/time, na.rm = TRUE))

  sample  ratio
  <fct>   <dbl>
1 a      0.131 
2 b      0.125 
3 c      0.0904

Reproducible Data

structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), 
    type = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), .Label = "bacteria", class = "factor"), 
    time = c(24, 44, 67, 88, 24, 44, 67, 88, 0, 24, 39, 64, 78, 
    108), measurement = c(0.57561, 1.67236, 4.171, 11.51661, 
    0.53269, 1.24942, 5.72147, 11.04017, 0, 0.47418, 1.06286, 
    3.59649, 7.0519, 7.2706)), class = "data.frame", row.names = c(NA, 
-14L))

Does this answer your question? [How to select the rows with maximum values in each group with dplyr?](https://stackoverflow.com/questions/24237399/how-to-select-the-rows-with-maximum-values-in-each-group-with-dplyr) — camille, Dec 05 '19 at 16:29
I don't think it does exactly. It's similar but I have a compounded conditional where the answer there just looks to find the maximum value. — TheSciGuy, Dec 05 '19 at 16:47
I'm not sure what you're trying to do exactly, since in your example you're taking the maximum of the ratio. Is that not the condition you want? — camille, Dec 05 '19 at 16:59
It's very similar but I needed to involve the use of multiple columns (not just a max of a single value in a column), minor difference. I can see why it would be marked a duplicate. However, I'm an experienced SO/Googler could not find the answer so others may struggle too. I provided a link that will be helpful for future SO users in case they follow my search patterns — TheSciGuy, Dec 05 '19 at 17:08

score 3 · Accepted Answer · answered Dec 05 '19 at 16:23

3

df %>%
  mutate(ratio = measurement/time) %>%
  group_by(sample) %>%
  filter(ratio == max(ratio, na.rm=TRUE))

answered Dec 05 '19 at 16:23

johnckane

645
8
18

Pedro Cavalcante · Answer 2 · 2019-12-05T16:31:54.250

3

This should do the trick.

df %>%
   group_by(sample) %>%
   mutate(ratio = measurement/time) %>%
   filter(ratio == max(ratio))

edited Dec 05 '19 at 16:31

answered Dec 05 '19 at 16:23

Pedro Cavalcante

414
4
14

2

Why are you calling `summarise`? Why not just drop the column you don't want to keep? – camille Dec 05 '19 at 16:30
Well pointed. In fact there's no need to drop columns or summarise anything. By construction there'll be only a single row per sample. – Pedro Cavalcante Dec 05 '19 at 16:32
This drops my "c" sample. – TheSciGuy Dec 05 '19 at 17:09
Do you have any missing data? If so, pass `max()` the `na.rm = TRUE` argument – Pedro Cavalcante Dec 05 '19 at 20:23

akrun · Answer 3 · 2019-12-05T16:30:29.770

2

An option would be to filter 'measurement' based on the max position of measurement/time and use that to compare (==) with the 'measurement' values after grouping by 'sample'

library(dplyr)
df %>%
   group_by(sample) %>% 
   mutate(ratio = measurement/time) %>%
   filter(measurement == measurement[which.max(ratio)])

edited Dec 05 '19 at 16:30

answered Dec 05 '19 at 16:25

akrun

874,273
37
540
662

Summarize with mathematical conditions in dplyr

3 Answers3

Linked