6

I am trying to use summarise and group by from dplyr in R however when I use a variable in place of explicitly calling the summarized column it uses the sum of dist for the entire data set for each row rather then grouping properly. This can easily be seen in the difference between TestBad and TestGood below. I just want to be able to replicate TestGood's results using the GraphVar variable as in TestBad.

    require("dplyr")
    GraphVar <- "dist"

    TestBad <- summarise(group_by_(cars,"speed"),Sum=sum(cars[[GraphVar]],na.rm=TRUE),Count=n())

    testGood <- summarise(group_by_(cars,"speed"),Sum=sum(dist,na.rm=TRUE),Count=n())

Thanks!

Urza5589
  • 139
  • 1
  • 9
  • You'll need the standard evaluation functions from dplyr. See an example [here](http://stackoverflow.com/questions/27975124/pass-arguments-to-dplyr-functions) and the [nse vignette here](https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html) – aosmith Aug 31 '16 at 14:36
  • @aosmith They're already using standard evaluation (`group_by_`) and are having trouble with it, I reckon. – Frank Aug 31 '16 at 14:36

2 Answers2

12

In February 2020 there are tidyeval tools for this from package rlang. In particular, if using strings you can use the .data pronoun.

library(dplyr)
GraphVar = "dist"
cars %>%
     group_by(.data[["speed"]]) %>%
     summarise(Sum = sum(.data[[GraphVar]], na.rm = TRUE),
               Count = n() )

While they will be superseded (but not deprecated) in dplyr 1.0.0, the scoped helper *_at() functions are useful when working with strings.

cars %>%
     group_by_at("speed") %>%
     summarise_at(.vars = vars(GraphVar), 
                  .funs = list(Sum = ~sum(., na.rm = TRUE),
                               Count = ~n() ) )

In 2016 you needed the standard evaluation function summarise_() along with lazyeval::interp(). This still works in 2020 but has been deprecated.

library(lazyeval)
cars %>%
    group_by_("speed") %>%
    summarise_(Sum = interp(~sum(var, na.rm = TRUE), var = as.name(GraphVar)), 
             Count = ~n() )
aosmith
  • 34,856
  • 9
  • 84
  • 118
  • 1
    this usage is deprecated – user680111 Feb 22 '20 at 18:12
  • @user680111 Yes, this answer is from 2016, which predates the current tidyeval approach. Was the downvote to ask for an updated answer or something else? – aosmith Feb 25 '20 at 15:12
  • yeah - update would be appreciated. Most of the solutions for dynamic variable selection in dplyr correspond to obsolete usage – user680111 Feb 26 '20 at 12:32
  • 2
    @user680111 I updated yesterday. It's actually interesting that the old way, while deprecated, still works. – aosmith Feb 26 '20 at 15:26
  • how to do the .data pronounce for more than one variable – Indranil Gayen Apr 27 '20 at 12:55
  • @IndranilGayen I haven't looked into it much, but I don't *think* it's meant to be used with multiple variables at once. For example, for two grouping variables you'd do `cars %>% group_by(.data[["speed"]], .data[["dist"]])`. If you want to pass a vector of variable names as strings you may end up back with `rlang::syms()`. Or `group_by_at()`. Since all the scoped functions are to be superseded in dplyr 1.0.0 I would think you could use `across()` but don't see how it works with `group_by()` yet. – aosmith Apr 27 '20 at 14:14
3

The latest usage for referring to one or more columns by name seems to be

cars %>% group_by(across("speed")) %>% ...
cars %>% group_by(across(c("speed", "dist"))) %>% ...

See vignette("colwise"), section Other verbs.

James Baye
  • 43
  • 5