4

My question builds on a similar one by imposing an additional constraint that the name of each variable should appear only once.

Consider a data frame

library( tidyverse )
df <- tibble( potentially_long_name_i_dont_want_to_type_twice = 1:10,
              another_annoyingly_long_name = 21:30 )

I would like to apply mean to the first column and sum to the second column, without unnecessarily typing each column name twice.

As the question I linked above shows, summarize allows you to do this, but requires that the name of each column appears twice. On the other hand, summarize_at allows you to succinctly apply multiple functions to multiple columns, but it does so by calling all specified functions on all specified columns, instead of doing it in a one-to-one fashion. Is there a way to combine these distinct features of summarize and summarize_at?

I was able to hack it with rlang, but I'm not sure if it's any cleaner than just typing each variable twice:

v <- c("potentially_long_name_i_dont_want_to_type_twice",
       "another_annoyingly_long_name")
f <- list(mean,sum)

## Desired output
smrz <- set_names(v) %>% map(sym) %>% map2( f, ~rlang::call2(.y,.x) )
df %>% summarize( !!!smrz )
# # A tibble: 1 x 2
#   potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                             <dbl>                        <int>
# 1                                             5.5                          255

EDIT to address some philosophical points

I don’t think that wanting to avoid the x=f(x) idiom is unreasonable. I probably came across a bit overzealous about typing long names, but the real issue is actually having (relatively) long names that are very similar to each other. Examples include nucleotide sequences (e.g., AGCCAGCGGAAACAGTAAGG) and TCGA barcodes. Not only is autocomplete of limited utility in such cases, but writing things like AGCCAGCGGAAACAGTAAGG = sum( AGCCAGCGGAAACAGTAAGG ) introduces unnecessary coupling and increases the risk that the two sides of the assignment might accidentally go out of sync as the code is developed and maintained.

I completely agree with @MrFlick about dplyr increasing code readability, but I don’t think that readability should come at the cost of correctness. Functions like summarize_at and mutate_at are brilliant, because they strike a perfect balance between placing operations next to their operands (clarity) and guaranteeing that the result is written to the correct column (correctness).

By the same token, I feel that the proposed solutions which remove variable mention altogether swing too far in the other direction. While inherently clever -- and I certainly appreciate the extra typing they save -- I think that, by removing the association between functions and variable names, such solutions now rely on proper ordering of variables, which creates its own risks of accidental errors.

In short, I believe that a self-mutating / self-summarizing operation should mention each variable name exactly once.

Artem Sokolov
  • 13,196
  • 4
  • 43
  • 74
  • 1
    Basically a duplicate of this question: https://stackoverflow.com/questions/36822672/summarize-different-columns-with-different-functions Keep in mind that one of the great benefits of using dplyr is that it makes your code easier to read. Being this adverse to typing column names seems odd. Are you using Rstudio's autocomplete features? Is this really a problem? – MrFlick Apr 11 '19 at 20:57
  • I agree with @MrFlick, but if you really are that averse to typing the names, you could create the `sym`'s up front from `v` and then just do a regular `summarise()` referring to everything via `!!` and the resulting list of `sym`'s. – joran Apr 11 '19 at 21:01

4 Answers4

2

I propose 2 tricks to solve this issue, see the code and some details for both solutions at the bottom :

A function .at that returns results for for groups of variables (here only one variable by group) that we can then unsplice, so we benefit from both worlds, summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), mean),
  !!!.at(vars(another_annoyingly_long_name), sum))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <dbl>
#   1                                             5.5                          255

An adverb to summarize, with a dollar notation shorthand.

df %>%
  ..flx$summarize(potentially_long_name_i_dont_want_to_type_twice = ~mean(.),
                  another_annoyingly_long_name = ~sum(.))

# # A tibble: 1 x 2
#     potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                               <dbl>                        <int>
#   1                                             5.5                          255

code for .at

It has to be used in a pipe because it uses the . in the parent environment, messy but it works.

.at <- function(.vars, .funs, ...) {
  in_a_piped_fun <- exists(".",parent.frame()) &&
    length(ls(envir=parent.frame(), all.names = TRUE)) == 1
  if (!in_a_piped_fun)
    stop(".at() must be called as an argument to a piped function")
  .tbl <- try(eval.parent(quote(.)))
  dplyr:::manip_at(
    .tbl, .vars, .funs, rlang::enquo(.funs), rlang:::caller_env(),
    .include_group_vars = TRUE, ...)
}

I designed it to combine summarize and summarize_at :

df %>% summarize(
  !!!.at(vars(potentially_long_name_i_dont_want_to_type_twice), list(foo=min, bar = max)),
  !!!.at(vars(another_annoyingly_long_name), median))

# # A tibble: 1 x 3
#       foo   bar another_annoyingly_long_name
#     <dbl> <dbl>                        <dbl>
#   1     1    10                         25.5

code for ..flx

..flx outputs a function that replaces its formula arguments such as a = ~mean(.) by calls a = purrr::as_mapper(~mean(.))(a) before running. Convenient with summarize and mutate because a column cannot be a formula so there can't be any conflict.

I like to use the dollar notation as a shorthand and to have names starting with .. so I can name those "tags" (and give them a class "tag") and see them as different objects (still experimenting with this). ..flx(summarize)(...) will work as well though.

..flx <- function(fun){
  function(...){
    mc <- match.call()
    mc[[1]] <- tail(mc[[1]],1)[[1]]
    mc[] <- imap(mc,~if(is.call(.) && identical(.[[1]],quote(`~`))) {
      rlang::expr(purrr::as_mapper(!!.)(!!sym(.y))) 
    } else .)
    eval.parent(mc)
  }
}

class(..flx) <- "tag"

`$.tag` <- function(e1, e2){
  # change original call so x$y, which is `$.tag`(tag=x, data=y), becomes x(y)
  mc <- match.call()
  mc[[1]] <- mc[[2]]
  mc[[2]] <- NULL
  names(mc) <- NULL
  # evaluate it in parent env
  eval.parent(mc)
}
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
  • Thank you, @Moody. I appreciate all the effort you put into this answer. Your adverb / tag system looks quite intriguing. Do you have a package or some text that formalizes this? I'm curious to learn more. – Artem Sokolov Apr 12 '19 at 15:30
  • 1
    I think adverbs are underused, the concept of a function returning a function is not trivial, and it also adds brackets. A "tag" (still not sure about the name) is more intuitive, you stick it to your function and it changes its behavior, here by preprocessing the arguments, but it could be a side effect or add an additional argument. Tags allow you to develop in a single function features for many functions, for instance my solution works using `transform` as well. @G-Grothendieck's `gsubfn::fn` is pretty much another example of tag (and the only one I've seen). – moodymudskipper Apr 12 '19 at 16:30
  • 1
    The system is really young and I'm still figuring out, but I'll take this opportunity to formalize it into a package and write a draft doc on the philosophical points this weekend and ping you here. – moodymudskipper Apr 12 '19 at 16:37
  • Thank you, @Moody. I've been using `purrr::partial` and `purrr::compose` quite extensively (probably more than I care to admit), but I've always felt that all the extra brackets made things messy. I wonder if this would be a great setting for adverbs. – Artem Sokolov Apr 12 '19 at 17:09
  • 1
    Definitely, *magrittr*'s functional sequences take all the features from `compose` and `partial` without the brackets, but the pipe symbols are bulky and starting from a `.` is awkward, so I wrapped it into a tag `..fs` and I can do `fun <- ..fs$substr(1,3)$toupper$paste0("-",.); fun("hello")` and it'll return `"-HEL"` – moodymudskipper Apr 12 '19 at 18:10
  • @ArtemSokolov you'll find my package *tags* there : https://github.com/moodymudskipper/tags along with a readme which showcases the features + some reflections. If you feel like sharing your thoughts about it don't hesitate to open an issue as I'd love to hear them. – moodymudskipper Apr 15 '19 at 23:23
  • 1
    Got it, @Moody_Mudskipper. Lots of interesting ideas there! I'll post my thoughts later today or tomorrow. – Artem Sokolov Apr 16 '19 at 14:45
  • awesome, it is discussed a bit here as well : https://community.rstudio.com/t/classes-attributes-of-functions/28430 – moodymudskipper Apr 16 '19 at 15:20
2

Use .[[i]] and !!names(.)[i]:= to refer to the ith column and its name.

library(tibble)
library(dplyr)
library(rlang)

df %>% summarize(!!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]])) 

giving:

# A tibble: 1 x 2
  potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
                                            <dbl>                        <int>
1                                             5.5                          255

Update

If df were grouped (it is not in the question so this is not needed) then surround summarize with a do like this:

library(dplyr)
library(rlang)
library(tibble)

df2 <- tibble(a = 1:10, b = 11:20, g = rep(1:2, each = 5))

df2 %>%
  group_by(g) %>%
  do(summarize(., !!names(.)[1] := mean(.[[1]]), !!names(.)[2] := sum(.[[2]]))) %>%
  ungroup

giving:

# A tibble: 2 x 3
      g     a     b
  <int> <dbl> <int>
1     1     3    65
2     2     8    90
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Note that if the tibble is grouped the groups will be ignored with this solution – moodymudskipper Apr 11 '19 at 22:23
  • 1
    `df` in the question is not grouped. Clearly if the problem changes then you need to change the answer appropriately. In the case of grouped data one needs to surround `summarize` with `do`. See the Update. – G. Grothendieck Apr 11 '19 at 22:45
  • Indeed, the given example is minimal, the wider issue of using summarize with long variable names though applies frequently to grouped data frames so your edit is useful. – moodymudskipper Apr 11 '19 at 23:03
1

Here's a hacky function that uses unexported functions from dplyr so it is not future proof, but you can specify a different summary for each column.

summarise_with <- function(.tbl, .funs) {
  funs <- enquo(.funs)
  syms <- syms(tbl_vars(.tbl))
  calls <- dplyr:::as_fun_list(.funs, funs, caller_env())
  stopifnot(length(syms)==length(calls))
  cols <- purrr::map2(calls, syms, ~dplyr:::expr_substitute(.x, quote(.), .y))
  cols <- purrr::set_names(cols, purrr::map_chr(syms, rlang::as_string))
  summarize(.tbl, !!!cols)
}

Then you could do

df %>% summarise_with(list(mean, sum))

and not have to type the column names at all.

MrFlick
  • 195,160
  • 17
  • 277
  • 295
1

It seems like you can use map2 for this.

map2_dfc( df[v], f, ~.y(.x))

# # A tibble: 1 x 2
#   potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
#                                             <dbl>                        <int>
# 1                                             5.5                          255
IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38