1

I am trying to calculate the absolute difference between lagged values over several columns. The first row of the resulting data set is NA, which is correct because there is no previous value to calculate the lag. What I don't understand is why the lag isn't calculated for the last value. Note that the last value in the example below (temp) is the lag between the 2nd to last and the 3rd to last values, the lag value between the last and 2nd to last value is missing.

library(tidyverse)
library(purrr)
dim(mtcars) # 32 rows
temp <- map_df(mtcars, ~ abs(diff(lag(.x))))
names(temp) <- paste(names(temp), '.abs.diff.lag', sep= '') 
dim(temp) # 31 rows

It would be an awesome bonus if someone could show me how to pipe the renaming step, I played around with paste and enquo. The real dataset is too long to do a gather/newcolumnname/spread approach.

Thanks in advance!

EDIT: libraries need to run the script added

RoseS
  • 185
  • 2
  • 12
  • Using `lag` and `diff` together is probably not doing what you think it is. `lag(x, lag=1)` returns x[n-1], `diff(x)` returns `x[n] - x[n-1]`. – Mako212 Oct 02 '17 at 16:30

3 Answers3

3

I think the lag call in your existing code is unnecessary as diff calculates the lagged difference automatically (although perhaps I don't understand properly what you are trying to do). You can also use rename_all to add a suffix to all the variable names.

library(purrr)
library(dplyr)
mtcars %>%
  map_df(~ abs(diff(.x))) %>%
  rename_all(funs(paste0(., ".abs.diff.lag")))
#> # A tibble: 31 x 11
#>    mpg.abs.diff.lag cyl.abs.diff.lag disp.abs.diff.lag hp.abs.diff.lag
#>               <dbl>            <dbl>             <dbl>           <dbl>
#>  1              0.0                0               0.0               0
#>  2              1.8                2              52.0              17
#>  3              1.4                2             150.0              17
#>  4              2.7                2             102.0              65
#>  5              0.6                2             135.0              70
#>  6              3.8                2             135.0             140
#>  7             10.1                4             213.3             183
#>  8              1.6                0               5.9              33
#>  9              3.6                2              26.8              28
#> 10              1.4                0               0.0               0
#> # ... with 21 more rows, and 7 more variables: drat.abs.diff.lag <dbl>,
#> #   wt.abs.diff.lag <dbl>, qsec.abs.diff.lag <dbl>, vs.abs.diff.lag <dbl>,
#> #   am.abs.diff.lag <dbl>, gear.abs.diff.lag <dbl>,
#> #   carb.abs.diff.lag <dbl>
markdly
  • 4,394
  • 2
  • 19
  • 27
  • I used lag from the dplyr library to get the first row to fill in as NA in hopes it would keep the number of rows the same as the original dataset (for merging purposes)- but it's missing the last row. Without lag, it acts the same as the stats::lag and leaves off the first row of NAs. Thank you for the rename_all tip, works well! – RoseS Oct 02 '17 at 16:52
  • @RoseS, you could do this `map_df(~ abs(c(NA, diff(.x))))` to return the same number of rows as the original with the first row being `NA`. – markdly Oct 02 '17 at 16:58
  • oh yeah! That's perfect. Thank you so much! – RoseS Oct 02 '17 at 17:03
0

Maybe something like this:

dataCars <- mtcars%>%mutate(diffMPG = abs(mpg - lag(mpg)), 
                        diffHP = abs(hp - lag(hp)))

And then do this for all the columns you are interested in

DataTx
  • 1,839
  • 3
  • 26
  • 49
  • Thank you for the suggestion, with tons of columns I'm hoping to stick with map_df or sapply or something similar. Thank you! – RoseS Oct 02 '17 at 16:46
0

I was not able to reproduce your issues regarding the lag function. When I am executing your sample code, I retrieve a data frame consisting of 31 row, exactly as you mentioned, but the first row is not NA, it is already the subtraction of the 1st and 2nd row.

Regarding your bonus question, the answer is provided here:

temp <- map_df(mtcars, ~ abs(diff(lag(.x)))) %>% setNames(paste0(names(.), '.abs.diff.lag'))

This should result in the desired column naming.

Martin
  • 46
  • 4
  • Thanks so much, Martin! I had played around with setNames and didn't come up with that combo. It is much appreciated. I've no idea why we are getting different results, I even restarted R, so only tidyverse and purrr were attached and I still get the same results I described above – RoseS Oct 02 '17 at 16:37
  • It looks like stats::lag give the results you describe, dplyr::lag gives the results I described. I'd like to get a combo of both results so I can merge the new variables into my original dataset. Thank you! And please let me know if you've any further thoughts :) – RoseS Oct 02 '17 at 16:45