2

I'm trying to create new variables from a range of variables by applying the base function diff. There are many columns so I don't want to write a new line for each one.

For example, this gives the expected result for the column log_q1:

diff(sample_data$log_q1)

How should I change the code below to have each column calculate the successive differences:

sample_data %>%
  mutate_at(.vars = vars(log_q1:log_p4_P),
            .funs = diff(.))

My actual data has more columns than that in the sample_data so I need to apply the function over the specified range of columns.

data:

structure(list(log_q1 = c(4.46451539632553, 4.45672702457338, 
4.46849093210287, 4.40710670038922, 4.47269821145531, 4.51755453794231
), log_q2 = c(3.69203066137953, 3.69209593205576, 3.72460811572925, 
3.68489316075838, 3.68132860558727, 3.65547388070539), log_q3 = c(6.35343817467613, 
6.36680078210151, 6.34452408661417, 6.34673395951371, 6.34503040525476, 
6.38137436664581), log_q4 = c(4.91654824502687, 4.90168449056365, 
4.89057374461466, 4.91902305270721, 4.90895073701152, 4.93154405124844
), log_x = c(5.5456132613939, 5.59270203608838, 5.6459874308467, 
5.78580621981046, 5.91542190802005, 5.98725699391602), log_x_P = c(6.43238312660334, 
6.44392922055857, 6.40689954461054, 6.38460576433867, 6.45141087279131, 
6.49415458117386), log_p1_P = c(1.00009855051644, 1.00823093327985, 
1.00777219169537, 1.02406727993256, 0.987064584131476, 0.970631603489974
), log_p2_P = c(0.288932864453819, 0.271408689217244, 0.1987931956103, 
0.363077211007143, 0.248306892319478, 0.32056800906634), log_p3_P = c(-0.72266804722466, 
-0.709420563794482, -0.753215618865934, -0.787494816591678, -0.667983839554677, 
-0.664285394245111), log_p4_P = c(-0.94581159853887, -0.920729657461689, 
-1.01104472816803, -1.06193166229344, -0.891127390868887, -0.802435732725928
)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
), na.action = structure(40:46, .Names = c("40", "41", "42", 
"43", "44", "45", "46"), class = "omit"))
gm007
  • 547
  • 4
  • 11

2 Answers2

3

We need to concatenate with one more value to make the length equal as diff returns with a length one less than the length of the group. i.e.

> length(df$log_q1)
[1] 6
> length(diff(df$log_q1))
[1] 5

We can concatenate with a leading zero at the beginning to make the length equal.

df %>% mutate(log_q1_diff_test = c(0,diff(log_q1)))

To run the whole mutate_at set:

df %>% 
    mutate_at(.vars = vars(log_q1:log_p4_P), 
              .funs = list(~ c(0,diff(.))))
M.Viking
  • 5,067
  • 4
  • 17
  • 33
3

As already explained by @M.Viking diff returns a length one shorter than the original vector so you need to include 0 or NA to the output. Alternatively, you can make use of lag function to get the previous value.

library(dplyr)
sample_data %>% mutate(across(log_q1:log_p4_P, ~. - lag(.)))

lag by default returns first value as NA. If you want first value to be 0 instead you can do :

sample_data %>% mutate(across(log_q1:log_p4_P, ~. - lag(., default = first(.))))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Yes, both `mutate(across(` and `. - lag(.)` are better than `mutate_at` and `diff`! It is tricky how `default=0` does not give "correct" answer, good solution using `first(.)` instead. – M.Viking Mar 05 '21 at 03:04