R: purrr: using pmap for row-wise operations, but this time involving LOTS of columns

Question

This is not a duplicate of questions like e.g. Row-wise iteration like apply with purrr

I understand how to use pmap() to do a row-wise operation on a data-frame:

library(tidyverse)

df1 = tribble(~col_1, ~col_2, ~col_3,
               1,      5,      12,
               9,      3,      3,
               6,     10,     7)

foo = function(col_1, col_2, col_3) {
  mean(c(col_1, col_2, col_3))
}

df1 %>% pmap_dbl(foo)

This gives the function foo applied to every row:

[1] 6.000000 5.000000 7.666667

But this gets pretty unwieldy when I have more than a few columns, because I have to pass them all in explicitly. What if I had say, 8 columns in my dataframe df2 and I want to apply a function bar that potentially involves every single one of those columns?

set.seed(12345)
df2 = rnorm(n=24) %>% matrix(nrow=3) %>% as_tibble() %>%
  setNames(c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6", "col_7", "col_8"))

bar = function(col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8) {
  # imagine we do some complicated row-wise operation here
  mean(c(col_1, col_2, col_3, col_4, col_5, col_6, col_7, col_8))
}

df2 %>% pmap_dbl(bar)

Gives:

[1]  0.45085420  0.02639697 -0.28121651

This is clearly inadequate -- I have to add a new argument to bar for every single column. It's a lot of typing, and it makes the code less readable and more fragile. It seems like there should be a way to have it take a single argument x, and then access the variables I want by x$col_1 etc. Or something more elegant than the above at any rate. Is there any way to clean this code up using purrr?

I've found some of this depends on whether or not the order of the columns is in the order the arguments go into the function or if you are working with all columns or only some of them. For your simple example, where you are using all the columns, you could do something like `bar2 = function(...) mean(c(...))` — aosmith, Aug 05 '19 at 17:29
@aosmith That's just an artefact of the example I chose -- in general I will need to be able to refer to the columns by name and use them in different ways, rather than using `...` as a catchall. — dain, Aug 05 '19 at 17:33

score 4 · Accepted Answer · answered Aug 05 '19 at 17:50

You can use the ... and en-list them once they're in your function.

dot_tester <- function(...) {
  dots <- list(...)
  dots$Sepal.Length + dots$Petal.Width
}

purrr::pmap(head(iris), dot_tester)

[[1]]
[1] 5.3

[[2]]
[1] 5.1

[[3]]
[1] 4.9

[[4]]
[1] 4.8

[[5]]
[1] 5.2

[[6]]
[1] 5.8

However, this doesn't change your code being "fragile", since you still explicitly and exactly need to match your column names as names within your function. The bonus is not having to list them out in the <- function() call.

score 1 · Answer 2 · answered Aug 05 '19 at 17:29

1

The easiest (probably not safest) way I could think of would be to leverage the ... argument, to take any number of columns

library(tidyverse)

set.seed(12345)
df2  <-  rnorm(n=24) %>% matrix(nrow=3) %>% as_tibble() %>%
  setNames(c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6", "col_7", "col_8"))
#> Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
#> This warning is displayed once per session.

bar <- function(...){
  mean(c(...))
}
df2 %>% pmap_dbl(bar)
#> [1]  0.45085420  0.02639697 -0.28121651

^{Created on 2019-08-05 by the reprex package (v0.3.0)}

answered Aug 05 '19 at 17:29

Benjamin Schwetz

624
5
17

I need to be able to refer to the columns by name though. My actual use-case isn't as simple as calling `mean()`, that's just what I picked to keep the example simple. – dain Aug 05 '19 at 17:34
I see. so changing the body of bar is not allowed? – Benjamin Schwetz Aug 05 '19 at 17:47
The body of bar can be modified ... e.g. references to `col_i` might become `x[col_i]` or something. – dain Aug 05 '19 at 17:52

score 1 · Answer 3 · answered Aug 05 '19 at 18:46

@Brian's answer works, but I also found another method using purrr::transpose that lets me use a single named variable x rather than ..., and can access any of the columns by name:

foo = function(x) {
  (x$col_1 + x$col_2 + x$col_3)/3
}

df1 %>% transpose() %>% map_dbl(foo)

This gives the correct answer:

[1] 6.000000 5.000000 7.666667

As for the other dataframe:

set.seed(12345)
df2 = rnorm(n=24) %>% matrix(nrow=3) %>% as_tibble() %>%
  setNames(c("col_1", "col_2", "col_3", "col_4", "col_5", "col_6", "col_7", "col_8"))

bar = function(x) {
  mean(as.double(x))
}

df2 %>% transpose() %>% map_dbl(bar)

Gives:

[1]  0.45085420  0.02639697 -0.28121651

But I can also do this by referring to individual columns:

bar_2 = function(x) {
  x$col_2 + x$col_5 / x$col_3
}

df2 %>% transpose() %>% map_dbl(bar_2)

[1]  0.1347090 -1.2776983  0.8232767

I realise these particular examples could easily be accomplished with mutate but for times when a real row-wise iteration is called for I think this works well enough.

R: purrr: using pmap for row-wise operations, but this time involving LOTS of columns

3 Answers3