1

Building on this question: Summarize with conditions in dplyr I would like to use dplyr to summarize a column based on a mathematical condition (not string matching as in the linked post). I need to find the maximum measurement when the ratio of measurement/time is the highest, while creating a new column ratio. I'd also like to carry through the entire row, which I'm unsure how to do with dplyr's summarize function.


Example Data Frame

print(df)

   sample     type time measurement
1       a bacteria   24     0.57561
2       a bacteria   44     1.67236
3       a bacteria   67     4.17100
4       a bacteria   88    11.51661
5       b bacteria   24     0.53269
6       b bacteria   44     1.24942
7       b bacteria   67     5.72147
8       b bacteria   88    11.04017
9       c bacteria    0     0.00000
10      c bacteria   24     0.47418
11      c bacteria   39     1.06286
12      c bacteria   64     3.59649
13      c bacteria   78     7.05190
14      c bacteria  108     7.27060

Desired Output

  sample     type time measurement      ratio
1      a bacteria   88    11.51661 0.13087057
2      b bacteria   88    11.04017 0.12545648
3      c bacteria   78     7.05190 0.09040897

Failed Attempt

This only returns the two columns as defined by the group_by and summarize function, would like to have the entire row information carry through:

library(dplyr)
df %>% 
    group_by(sample) %>%
    summarize(ratio = max(measurement/time, na.rm = TRUE))

  sample  ratio
  <fct>   <dbl>
1 a      0.131 
2 b      0.125 
3 c      0.0904

Reproducible Data

structure(list(sample = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", "b", "c"), class = "factor"), 
    type = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), .Label = "bacteria", class = "factor"), 
    time = c(24, 44, 67, 88, 24, 44, 67, 88, 0, 24, 39, 64, 78, 
    108), measurement = c(0.57561, 1.67236, 4.171, 11.51661, 
    0.53269, 1.24942, 5.72147, 11.04017, 0, 0.47418, 1.06286, 
    3.59649, 7.0519, 7.2706)), class = "data.frame", row.names = c(NA, 
-14L))
TheSciGuy
  • 1,154
  • 11
  • 22
  • 1
    Does this answer your question? [How to select the rows with maximum values in each group with dplyr?](https://stackoverflow.com/questions/24237399/how-to-select-the-rows-with-maximum-values-in-each-group-with-dplyr) – camille Dec 05 '19 at 16:29
  • I don't think it does exactly. It's similar but I have a compounded conditional where the answer there just looks to find the maximum value. – TheSciGuy Dec 05 '19 at 16:47
  • I'm not sure what you're trying to do exactly, since in your example you're taking the maximum of the ratio. Is that not the condition you want? – camille Dec 05 '19 at 16:59
  • It's very similar but I needed to involve the use of multiple columns (not just a max of a single value in a column), minor difference. I can see why it would be marked a duplicate. However, I'm an experienced SO/Googler could not find the answer so others may struggle too. I provided a link that will be helpful for future SO users in case they follow my search patterns – TheSciGuy Dec 05 '19 at 17:08

3 Answers3

3
df %>%
  mutate(ratio = measurement/time) %>%
  group_by(sample) %>%
  filter(ratio == max(ratio, na.rm=TRUE))
johnckane
  • 645
  • 8
  • 18
3

This should do the trick.

df %>%
   group_by(sample) %>%
   mutate(ratio = measurement/time) %>%
   filter(ratio == max(ratio)) 
Pedro Cavalcante
  • 414
  • 4
  • 14
2

An option would be to filter 'measurement' based on the max position of measurement/time and use that to compare (==) with the 'measurement' values after grouping by 'sample'

library(dplyr)
df %>%
   group_by(sample) %>% 
   mutate(ratio = measurement/time) %>%
   filter(measurement == measurement[which.max(ratio)])
akrun
  • 874,273
  • 37
  • 540
  • 662