4

I have a data frame with 5 columns and I want to produce 4 additional columns giving my the difference between the last 4 columns and the first column. I tried the following, but that doesn't work:

library(tidyverse)
df <- as.tibble(data.frame(A = c(1,2), B = c(3,4), C = c(4,5), D = c(2,3), E = c(4,5)))
r_diff <- function(x,y){
  z = y - x
  return(z)
}
vars_to_process <- c("B","C","D","E")
df %>% mutate_at(.cols=vars_to_process, .funs =r_diff(.,df[,1])) %>% head()

Thanks Renger

arnyeinstein
  • 669
  • 1
  • 5
  • 14
  • Your `r_diff` function is the (with inverted arguments) `"-"` function, that you can use this way: `"-"(5, 4)` – jlesuffleur Aug 11 '17 at 11:57
  • 3
    If you convert the thing to a matrix you can subtract column 1 from the rest with a "-". Probably 10x or 100x faster. – Spacedman Aug 11 '17 at 12:11

3 Answers3

9

Here's the simplest way to do it.

df %>% 
   mutate_at(.vars = vars(B:E),
             .funs = list(~ . - A))

The .vars argument lets you specify columns in the same way that you would specify columns in select(), provided you put that specification inside the function vars().

The .funs argument accepts an anonymous function defined on the fly inside a call to list(). And you can reference a column in the dataframe (in this case A) when defining this anonymous function (see this Stackoverflow question).

In addition, with the release of dplyr 1.0.0, you can now simply do the following:

df %>%
   mutate(across(B:E, ~ . - A))
bschneidr
  • 6,014
  • 1
  • 37
  • 52
5

Here's a faster solution using base R code. Strategy is convert to a matrix, subtract column one from the required columns, build back into a data frame. Note this only returns the modified columns - if there are columns not in vars_to_process they'll not appear in the output but you didn't have any of those in your test set so I'll assume they don't exist.

So, always write things in functions whenever possible:

bsr = function(df,vars_to_process){
    m = as.matrix(df)
    data.frame(
         A = m[, 1],
             m[, 1] - m[, vars_to_process])}

Make some test data:

> df = data.frame(matrix(runif(5*1000), ncol=5))
> names(df)=LETTERS[1:5]
> dft = as.tibble(df)
> head(dft)
# A tibble: 6 x 5
          A          B         C         D         E
      <dbl>      <dbl>     <dbl>     <dbl>     <dbl>
1 0.2609174 0.07857624 0.2727817 0.8498004 0.3403234
2 0.3644744 0.95810657 0.8183856 0.2958133 0.4752349
3 0.6042914 0.98793218 0.7547003 0.9596591 0.5354045
4 0.4000441 0.61403331 0.9018804 0.3838347 0.3266855
5 0.6767012 0.11984219 0.9181570 0.5988404 0.6058629

Compare with the tidyverse version:

akr = function(df,vars_to_process){
   df %>% mutate_at(vars_to_process, funs(r_diff(.,df[[1]])))
   }

Check bsr and akr agree:

> head(bsr(dft, vars_to_process))
          A          B           C           D           E
1 0.2609174  0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119  0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765  0.06888696
4 0.4000441 -0.2139892 -0.50183635  0.01620939  0.07335861

> head(akr(dft, vars_to_process))
# A tibble: 6 x 5
          A          B           C           D           E
      <dbl>      <dbl>       <dbl>       <dbl>       <dbl>
1 0.2609174  0.1823412 -0.01186432 -0.58888295 -0.07940594
2 0.3644744 -0.5936322 -0.45391119  0.06866108 -0.11076050
3 0.6042914 -0.3836408 -0.15040892 -0.35536765  0.06888696
4 0.4000441 -0.2139892 -0.50183635  0.01620939  0.07335861

okay, except akr returns a tribble but nm. Benchmark:

> microbenchmark(bsr(dft, vars_to_process),akr(dft, vars_to_process))
Unit: microseconds
                      expr      min        lq      mean   median       uq
 bsr(dft, vars_to_process)  362.117  388.7215  488.9309  446.123  521.776
 akr(dft, vars_to_process) 8070.391 8365.4230 9853.5239 8673.692 9335.613

Base R version is 26 times faster. I'd also argue that subtracting a column from another set of columns is tidier than applying a mutator function but as long as you wrap what your doing in a function it doesn't matter how messy the guts are.

bschneidr
  • 6,014
  • 1
  • 37
  • 52
Spacedman
  • 92,590
  • 12
  • 140
  • 224
4

We need to subset the column with [[ as the [ is still a data.frame

df %>% 
   mutate_at(vars_to_process, funs(r_diff(.,df[[1]]))) 
# A tibble: 2 x 5
#     A     B     C     D     E
#  <dbl> <dbl> <dbl> <dbl> <dbl>
#1     1    -2    -3    -1    -3
#2     2    -2    -3    -1    -3
akrun
  • 874,273
  • 37
  • 540
  • 662