2

I tried to create a function that would return me x largest MOLECULES based on how many unique PATIENT_ID each of them has, in descending order. That from a certain date until the last.

data <- data.frame(PATIENT_ID = c(1,1,2,2), dateM = c(ymd("2020-01-05","2020-01-06","2020-05-06","2019-12-15")), MOLECULES = c("mol1", "mol1", "mol1", "mol2"))


topx <- function(data, datefrom, var ,  x = 5){
  data %>%
  subset(dateM >= datefrom) %>%
  group_by(var) %>%
  summarize(pat = length(unique(PATIENT_ID))) %>%
  arrange(-pat) %>% 
  head(x) %>% 
  select(1)
}

topx(data = data, datefrom = "2016-04", var = MOLECULES, x = 2) 

The wanted result in this case would be would be:

c("mol1","mol2")

However, it takes var as text and doesnt parse the MOLECULES in and tells me that.

 Error: Must group by variables found in `.data`.
* Column `var` is not found.
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Jirka Čep
  • 181
  • 7
  • 2
    Heads up, there’s also the function [`slice_max`](https://dplyr.tidyverse.org/reference/slice.html) in ‘dplyr’, which does something very similar; that said, I don’t think using it here would help. Apart from this, I recommend not mixing ‘dplyr’ functions with the base R equivalents. That is, use `filter` instead of `subset`. `filter` is more robust, provides better error messages when you do something wrong, and also works with interpolated variables via `{{…}}`. `subset` would *not* work with it. In principle the same is true with `head` vs `slice_head`, but the argument is less strong here. – Konrad Rudolph Jan 14 '21 at 13:23

2 Answers2

2

Cool function. There are special rules and operations when programming with dplyr. See more here. Specifically, you need the {{}} operator.


library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following objects are masked from 'package:base':
#> 
#>     date, intersect, setdiff, union

data <- data.frame(PATIENT_ID = c(1,1,2,2), dateM = c(ymd("2020-01-05","2020-01-06","2020-05-06","2019-12-15")), MOLECULES = c("mol1", "mol1", "mol1", "mol2"))

topx <- function(data, datefrom, var ,  x = 5){
  data %>%
    subset(dateM >= datefrom) %>%
    group_by({{var}}) %>%
    summarize(pat = length(unique(PATIENT_ID))) %>%
    arrange(-pat) %>% 
    head(x) %>% 
    select(1)
}

topx(data = data, datefrom = "2016-04-01", var = MOLECULES, x = 2) 
#> `summarise()` ungrouping output (override with `.groups` argument)
#> # A tibble: 2 x 1
#>   MOLECULES
#>   <chr>    
#> 1 mol1     
#> 2 mol2

Created on 2021-01-14 by the reprex package (v0.3.0)

Magnus Nordmo
  • 923
  • 7
  • 10
  • Follow up question, when I use this function in its intended use, i.e. summarized based on if some values of MOLECULES are in top x, it throws me ` `summarise()` regrouping output by 'dateM' (override with `.groups` argument) `summarise()` ungrouping output (override with `.groups` argument) ```, second message repeated for like 48 times. This persists even when I add `as.factor` or `as.character` to the end of the function – Jirka Čep Jan 14 '21 at 15:52
  • 1
    See answer here: https://stackoverflow.com/questions/62140483/how-to-interpret-dplyr-message-summarise-regrouping-output-by-x-override – Magnus Nordmo Jan 14 '21 at 20:54
0

I believe this is a quasi quotation issue. !! does a one-to-one evaluation of an expression. For more information see https://adv-r.hadley.nz/quasiquotation.html

Try:

topx <- function(data, datefrom, var ,  x = 5){
  var <- enquo(var)
  data %>%
  subset(dateM >= datefrom) %>%
  group_by(!!var) %>%
  summarize(pat = length(unique(PATIENT_ID))) %>%
  arrange(-pat) %>% 
  head(x) %>% 
  select(1)
}
latlio
  • 1,567
  • 7
  • 15