1

The Problem

I am trying to create a function that uses dplyr syntax and [], but am using quosures incorrectly. The problem stems from a rocky foundation with quosures and tidyeval. I am hoping someone can explain why my function isn't working.

Background

I found this code really useful and wanted to turn it into a function with which I could vary the arguments without using strings. I was able to get the function to this point, using the Programming with dplyr Vignette. (note: I changed the original code to meet my needs)

library(dplyr)    

persistence <- function(df, period, ...){
  period <- enquo(period)
  group_var <- quos(...)

  df %>% 
    group_by(!!! group_var, !! period) %>%
    summarise(persistence_rate = length(base::intersect(id, df$id[df$rank==(rank+1)]))/n_distinct(id))
}

Using the data I've provided below, using this function gives me my desired output:

persistence(data, period)

    # A tibble: 5 x 2
      period persistence_rate
      <chr>             <dbl>
    1 a                 0.500
    2 b                 1.00 
    3 c                 0.667
    4 d                 0.667
    5 e                 0. 

Unfortunately, when trying to vary the id and rank columns I was not sure how to incorporate the quosures.

What I've Tried

Using this data:

   data <- structure(list(id = c("A", "B", "C", "D", "A", "C", "A", "B", "C", "A", "D", "C", "A", "B", "C"),
                   period = c("a", "a", "a", "a", "b", "b", "c", "c", "c", "d", "d", "d", "e", "e", "e"),
                   rank = c(1, 1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5),
                   group = c("g1", "g2", "g1", "g2", "g1", "g1", "g1", "g2", "g1", "g1", "g2", "g1", "g1", "g2", "g1")),
                   .Names = c("id", "period", "rank", "group"),
                   row.names = c(NA, -15L),
                   class = c("tbl_df", "tbl", "data.frame"))

I ended up with this function:

persistence_new <- function(df, id, period, rank, ...){

  period <- enquo(period)
  id <- enquo(id)
  rank <- enquo(rank)
  group_var <- quos(...)

  df %>% 
    group_by(UQS(group_var), UQ(period)) %>%
    summarise(persistence_rate = length(base::intersect(UQ(id), UQ(id)[UQ(rank) == (UQ(rank) + 1)]))/n_distinct(UQ(id)))

}

Which gives me this result:

persistence_new(data, id, period, rank)

    # A tibble: 5 x 2
  period persistence_rate
  <chr>             <dbl>
1 a                    0.
2 b                    0.
3 c                    0.
4 d                    0.
5 e                    0.

It took me a long time to get it to this point. As I was trying different things, it would often spit out an error. Now, it is running, but not giving me the results I want.

I essentially tried every iteration of (),UQ, [], and [[]] that I could think of.

Thanks

I am hoping to learn more about tidyeval so that I don't have such a difficult time with this in the future. With that being said, and given that the problem is because of a lack of understanding, I would appreciate any perspectives on why my current function doesn't work. Any insight to make tidyeval more intuitive would be great.

Alternatively, feel free to point to me to a specific section of the Programming with dplyr Vignette. I've worked through the entire thing twice, but a specific section to focus on may be useful.

I appreciate the help. Let me know if I can provide any additional information.

SessionInfo

> sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2 dplyr_0.7.4 

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16          utf8_1.1.3            crayon_1.3.4          assertthat_0.2.0      R6_2.2.2             
 [6] magrittr_1.5          pillar_1.2.1          cli_1.0.0             rlang_0.2.0.9001      rstudioapi_0.7.0-9000
[11] tools_3.4.4           glue_1.2.0            yaml_2.1.19           compiler_3.4.4        pkgconfig_2.0.1      
[16] bindr_0.1.1           tibble_1.4.2
  • What exactly are you trying to calculate here? Can you describe in words what the `persistence_rate` column should contain in the end? – MrFlick Jun 26 '18 at 19:48
  • The `persistence_rate` is the percentage of ID's in one period who were present in the next period. – Christian Million Jun 26 '18 at 20:49

1 Answers1

2

I think this does what you want in a more dplyr-friendly way.

persistence_new <- function(df, id, period, rank, ...){

  period <- enquo(period)
  id <- enquo(id)
  rank <- enquo(rank)
  group_var <- quos(...)

  df %>% group_by(!!id) %>%   
    arrange(!!rank) %>% 
    mutate(nextrank = lead(!!rank)) %>% 
    group_by(!!!group_var, !!period) %>% 
    summarize(persistence_rate=sum(nextrank == !!rank + 1, na.rm=TRUE)/n())

}

persistence_new(data, id, period, rank)
#   period persistence_rate
#   <chr>             <dbl>
# 1 a                 0.5  
# 2 b                 1    
# 3 c                 0.667
# 4 d                 0.667
# 5 e                 0  

Rather than doing the sub-query type join, here we just use lead() to see if the next rank column is one more than the last and summarize based on that information. Since this uses all dplyr functions, they are friend easy to use with the bang-bang operator.

Also, it seems like period and rank are basically the same thing here. You don't need to require rank as a parameter if you want to calculate it inside the function. For example

persistence_new <- function(df, id, period, ...){

  period <- enquo(period)
  id <- enquo(id)
  group_var <- quos(...)

  data %>% 
    mutate(rank = group_indices(., period)) %>% 
    group_by(!!id) %>%   
    arrange(rank) %>% 
    mutate(nextrank = lead(rank)) %>% 
    group_by(!!!group_var, !!period) %>% 
    summarize(persistence_rate=sum(nextrank == rank + 1, na.rm=TRUE)/n())

}
persistence_new(data, id, period)
MrFlick
  • 195,160
  • 17
  • 277
  • 295
  • This code gives me the outcome that I am looking for. Thanks for the quick response! But do you happen to know how I might use tidyeval in tandem with Base R indexing similar to my example? I think it would be useful to know for future endeavors. When should I capture the value, when should I capture the expression, etc (As mentioned in the Vignette) and how? – Christian Million Jun 26 '18 at 20:48
  • Oh, and I was only able to get the above to work by changing your `data %>%` to `df %>%`. Just FYI. – Christian Million Jun 26 '18 at 21:10
  • Base R functions are unlikely to ever support tidyeval. But the real part that was probably messed up was the `df$` part that was not in the new version. Calling out to the data.frame again outside the context of grouping was important for that solution (and what made it non-dplyr friendly). But unquoting wont play nice with `$` thanks to the parser I believe. It's best to avoid those types of expressions if you want to work with variables. – MrFlick Jun 26 '18 at 21:13