My question builds on a similar one by imposing an additional constraint that the name of each variable should appear only once.
Consider a data frame
library( tidyverse )
df <- tibble( potentially_long_name_i_dont_want_to_type_twice = 1:10,
another_annoyingly_long_name = 21:30 )
I would like to apply mean
to the first column and sum
to the second column, without unnecessarily typing each column name twice.
As the question I linked above shows, summarize
allows you to do this, but requires that the name of each column appears twice. On the other hand, summarize_at
allows you to succinctly apply multiple functions to multiple columns, but it does so by calling all specified functions on all specified columns, instead of doing it in a one-to-one fashion. Is there a way to combine these distinct features of summarize
and summarize_at
?
I was able to hack it with rlang
, but I'm not sure if it's any cleaner than just typing each variable twice:
v <- c("potentially_long_name_i_dont_want_to_type_twice",
"another_annoyingly_long_name")
f <- list(mean,sum)
## Desired output
smrz <- set_names(v) %>% map(sym) %>% map2( f, ~rlang::call2(.y,.x) )
df %>% summarize( !!!smrz )
# # A tibble: 1 x 2
# potentially_long_name_i_dont_want_to_type_twice another_annoyingly_long_name
# <dbl> <int>
# 1 5.5 255
EDIT to address some philosophical points
I don’t think that wanting to avoid the x=f(x)
idiom is unreasonable. I probably came across a bit overzealous about typing long names, but the real issue is actually having (relatively) long names that are very similar to each other. Examples include nucleotide sequences (e.g., AGCCAGCGGAAACAGTAAGG
) and TCGA barcodes. Not only is autocomplete of limited utility in such cases, but writing things like AGCCAGCGGAAACAGTAAGG = sum( AGCCAGCGGAAACAGTAAGG )
introduces unnecessary coupling and increases the risk that the two sides of the assignment might accidentally go out of sync as the code is developed and maintained.
I completely agree with @MrFlick about dplyr
increasing code readability, but I don’t think that readability should come at the cost of correctness. Functions like summarize_at
and mutate_at
are brilliant, because they strike a perfect balance between placing operations next to their operands (clarity) and guaranteeing that the result is written to the correct column (correctness).
By the same token, I feel that the proposed solutions which remove variable mention altogether swing too far in the other direction. While inherently clever -- and I certainly appreciate the extra typing they save -- I think that, by removing the association between functions and variable names, such solutions now rely on proper ordering of variables, which creates its own risks of accidental errors.
In short, I believe that a self-mutating / self-summarizing operation should mention each variable name exactly once.