3

I would like to use dplyr::mutate to add two named columns to a dataframe simulataneously and with a single function call. Consider the following example

library(dplyr)

n <- 1e2; M <- 1e3
variance <- 1

x <- rnorm(n*M, 0, variance)
s <- rep(1:M, each = n)

dat <- data.frame(s = s, x = x)

ci_studclt <- function(x, alpha = 0.05) {
  n <- length(x)
  S_n <- var(x)
  mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
}

ci_studclt(x)

Trying something like the below returns an error, since obviously two values are produced and cannot be inserted into a single atomic-type column.

dat %>% 
  group_by(s) %>% 
  mutate(ci = ci_studclt(x, variance))

It seems one option is to insert a list column then unnest_wider and that this is easier with data.table or the specific case of splitting a string column into two new columns.

In my example, a confidence interval (lower and upper bound) come out of a function and I would like to directly add both as new columns to dat e.g. calling the columns ci_lower and ci_upper.

Is there a straightforward way of doing this with dplyr or do I need to insert the elements as a list column then unnest?

NB Keep in mind that the confidence interval values are a function of a group of simulated values x, grouped by s; the CI values should be constant within a group.

Anil
  • 1,097
  • 7
  • 20

3 Answers3

2

You can do this by having your function (or a wrapper function) return a data.frame. When you call it in mutate, don’t specify a column name (or else you’ll end up with a nested data.frame column). If you want to specify names for the new columns, you can include them as function arguments as in the below.


library(dplyr)

n <- 1e2; M <- 1e3
variance <- 1

x <- rnorm(n*M, 0, variance)
s <- rep(1:M, each = n)

dat <- data.frame(s = s, x = x)

ci_studclt <- function(x, alpha = 0.05) {
  n <- length(x)
  S_n <- var(x)
  mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
}

ci_wrapper <- function(x, alpha = 0.05, names_out = c("ci_lower", "ci_upper")) {
  ci <- ci_studclt(x, alpha = alpha)
  out <- data.frame(ci[[1]], ci[[2]])
  names(out) <- names_out
  out
}

# original code was ci_studclt(x, variance)
# but ci_studclt() doesn't take a variance argument, so I omitted
dat %>% 
  group_by(s) %>% 
  mutate(ci_wrapper(x))

output:

# A tibble: 100,000 x 4
# Groups:   s [1,000]
       s       x ci_lower ci_upper
   <int>   <dbl>    <dbl>    <dbl>
 1     1  0.233    -0.223    0.139
 2     1  1.03     -0.223    0.139
 3     1  1.53     -0.223    0.139
 4     1  0.0150   -0.223    0.139
 5     1 -0.211    -0.223    0.139
 6     1 -1.13     -0.223    0.139
 7     1 -1.51     -0.223    0.139
 8     1  0.371    -0.223    0.139
 9     1  1.80     -0.223    0.139
10     1 -0.137    -0.223    0.139
# ... with 99,990 more rows

With specified column names:

dat %>% 
  group_by(s) %>% 
  mutate(ci_wrapper(x, names_out = c("ci.lo", "ci.hi")))

output:

# A tibble: 100,000 x 4
# Groups:   s [1,000]
       s       x  ci.lo ci.hi
   <int>   <dbl>  <dbl> <dbl>
 1     1  0.233  -0.223 0.139
 2     1  1.03   -0.223 0.139
 3     1  1.53   -0.223 0.139
 4     1  0.0150 -0.223 0.139
 5     1 -0.211  -0.223 0.139
 6     1 -1.13   -0.223 0.139
 7     1 -1.51   -0.223 0.139
 8     1  0.371  -0.223 0.139
 9     1  1.80   -0.223 0.139
10     1 -0.137  -0.223 0.139
# ... with 99,990 more rows
zephryl
  • 14,633
  • 3
  • 11
  • 30
  • +1. In a similar vein to @Allan's answer. I'd like to avoid modifying the inputs if possible but still upvoted. – Anil Feb 15 '22 at 14:16
  • @Anil Thanks. Can you clarify what you mean by “modifying the inputs”? – zephryl Feb 15 '22 at 14:19
  • @Anil, if you mean you don’t want to modify the original function, you could instead wrap the original function in a new function that returns its output as a dataframe. I’ve modified my answer to demonstrate this approach. – zephryl Feb 15 '22 at 14:29
  • as you elaborate, I mean that I would like to not have to pass a different data structure into `mutate`. I had in mind something like tuple unpacking in Python, which I guess is tricky since here the vectorization is _along_ the elements of `ci` – Anil Feb 15 '22 at 14:44
  • 1
    Got it. Just FYI, I think returning a df / tibble is the “canonical” way of creating multiple columns inside a `dplyr` verb. e.g., see Hadley’s post about this functionality in `summarize`: https://www.tidyverse.org/blog/2020/03/dplyr-1-0-0-summarise/ . – zephryl Feb 15 '22 at 15:01
1

If you get your function to return a two-column data frame with repeated values of the same length as the input, then this becomes very easy:

ci_studclt <- function(x, alpha = 0.05) {
  n <- length(x)
  S_n <- var(x)
  res <- mean(x) + qt(c(alpha/2, 1 - alpha/2), df = n-1)*sqrt(S_n / n)
  data.frame(lower = rep(res[1], length(x)), upper = res[2])
}

dat %>% 
  group_by(s) %>% 
  mutate(ci_studclt(x))

#> # A tibble: 100,000 x 4
#> # Groups:   s [1,000]
#>        s      x  lower upper
#>    <int>  <dbl>  <dbl> <dbl>
#>  1     1 -0.767 -0.147 0.293
#>  2     1 -0.480 -0.147 0.293
#>  3     1 -1.31  -0.147 0.293
#>  4     1  0.219 -0.147 0.293
#>  5     1  0.650 -0.147 0.293
#>  6     1  0.542 -0.147 0.293
#>  7     1 -0.249 -0.147 0.293
#>  8     1  2.22  -0.147 0.293
#>  9     1 -0.239 -0.147 0.293
#> 10     1  0.176 -0.147 0.293
#> # ... with 99,990 more rows

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • +1. Super, and also handles the grouping. I consider this a hack (kind of, since you have to modify the inputs) but it addresses the issue perfectly. I'll accept this answer if none comes along that directly makes `dplyr` do the work for you. Cheers – Anil Feb 15 '22 at 13:47
1

Other possible variation (if you don't want to change your ci_studclt function) how it can be done:

dat %>% 
  group_by(s) %>% 
  mutate(
    across(x,
           .fns = list(
             lower = ~ci_studclt(.)[1],
             upper = ~ci_studclt(.)[2]
             )
           )
    )

In this case output will also contain new x_lower and x_upper columns. This variant is also somewhat scalable, so if you want to calculate your function over other column y as well, you can just replace x with c(x,y) and have also y_lower and y_upper columns in dat as well.

UPDATE

Actually, all the stuff that Allan did in his answer could be done inside mutate call and without any modification of initial function:

dat %>% 
  group_by(s) %>% 
  mutate(
    t(ci_studclt(x)) %>% 
      as.data.frame() %>% 
      set_names(c('ci_lower','ci_upper'))
    )

We just transpose an output from ci_studclt(x) for treating it as row by data.frame function and give this 1-row dataframe correct names.

  • 1
    +1 for use of `across` which I wasn't aware of. Thanks. This contains two function calls, which I specified against in the question (I would like to be able to use this for expensive functions without extra overhead). – Anil Feb 15 '22 at 14:42
  • 1
    Modified a little @Allan's answer to fit your requirements – Kirill Shilov Feb 15 '22 at 18:04