I was about to ask a very similar question to this. Basically, asking how to use pmap
within mutate
without having to use the variable names more than once.
Instead, I'll post it as an 'answer' here as it includes a reprex and a number of options that I've found, none of which are completely satisfactory to me.
Hopefully somebody else might be able to answer how to do it as required.
I often want to use purrr::pmap
inside dplyr::mutate
when working with a data.frame with list-columns.
Occassionally this involves a lot of repetition of variable names.
I'd like to be able to do this more succinctly, using an anonymous function so that the variables are only used once, when passed to pmap
's .f
argument.
Take this small dataset as an example:
library('dplyr')
library('purrr')
df <- tribble(
~x, ~y, ~z,
c(1), c(1,10), c(1, 10, 100),
c(2), c(2,20), c(2, 20, 200),
)
Say the function I want to apply to each row is
func <- function(x, y, z){c(sum(x), sum(y), sum(z))}
In practice the function will be more complex, with lots of variables.
The function is only needed once, so I'd prefer not to have to name it explicitly and clog up my script and my working environment.
Here are the options. Each creates exactly the same data.frame but in a different way. The reason for including avg
will be come clear.
Note I'm not considering position matching using ..1
, ..2
, etc. as this is easy to mess up.
# Explicitly create a function for `.f`.
# This requires using the variable names (x, y, z) three times.
# It's completely clear what it's doing, but needs a lot of typing.
# It might sometimes fail - see https://github.com/tidyverse/purrr/issues/280
df_explicit <- df %>%
mutate(
avg = x - mean(x),
a = pmap(.l = list(x, y, z), .f = function(x, y, z){ c(sum(x), sum(y), sum(z)) })
)
# Pass the whole of `df` to `.l` and add `...` in an explicit function to deal with any unused columns.
# variable names are used twice.
# `df` will have to be passes explicitly if not using pipes (eg, `mutate(.data = df, a = pmap(.l = df, ...`).
# This is probably inefficient for large datasets.
df_dots <- df %>%
mutate(
avg = x - mean(x),
a = pmap(.l = ., .f = function(x, y, z, ...){ c(sum(x), sum(y), sum(z)) })
)
# Use `pryr::f` (as discussed in https://stackoverflow.com/a/51123520/4269699).
# Variable names are used twice.
# Potentially unexpected behaviour.
# Not obvious to the casual reader why the extra `pryr::f` is needed and what it's doing
df_pryrf <- df %>%
mutate(
avg = x - mean(x),
a = pmap(.l = list(x,y,z), .f = pryr::f({c(sum(x), sum(y), sum(z))} ))
)
# Use `rowwise()` similar to this: https://stackoverflow.com/a/47734073/4269699
# Variable names are used once.
# It will mess up any vectorised functions used elsewhere in mutate, hence the two `mutate()`s
df_rowwise <- df %>%
mutate( avg = x - mean(x) ) %>%
rowwise() %>%
mutate( a = list( {c(sum(x), sum(y), sum(z))} ) ) %>%
ungroup()
# Use Romain Francois' neat {rap} package.
# Variable names used once.
# Like `rowwise()` it will mess up any vectorised functions so it needs two `mutate()`s for this particular problem
#
library('rap') #devtools::install_github("romainfrancois/rap")
df_rap <- df %>%
mutate( avg = x - mean(x) ) %>%
rap( a = ~ c(sum(x), sum(y), sum(z)) )
# Another solution discussed here https://stackoverflow.com/a/51123520/4269699 doesn't seem to work inside `mutate()`, but maybe could be tweaked?
# Like the `pryr::f` solution, it's not immediately obvious what the purpose of the `with(list(...` bit is.
df_with <- df %>%
mutate(
avg = x-mean(x),
a = pmap(.l = list(x,y,z), .f = ~with(list(...), { c(sum(x), sum(y), sum(z))} ))
)
As far as I know these are the options, excluding position matching.
Ideally, something like the following would be possible, where the function qmap
knows to find (rowwise) variables x
, y
, and z
from the object passed to mutate
s .data
argument.
df_new <- df %>%
mutate(
avg = x-mean(x),
a = qmap( ~c(sum(x), sum(y), sum(z)) )
)
But I don't know how to do this, so consider this only a partial answer.
Related issues: