1

I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.

I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.

I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).

However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)

Here's an example df for illustration and a solution I tried:

library(data.table)

df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))

that's how it works for a single column

 diff_x <- df[!is.na(x), lag_diff := x - shift(x)]  # actually what I want, but for more columns at once

and that's how I apply a diff function over several columns with lapply

diff_all <- setDT(df)[,lapply(.SD, diff)]  # not exactly what I want because NAs are not ignored and  the difference between numeric values is not calculated 

I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.

zx8754
  • 52,746
  • 12
  • 114
  • 209
Katrin
  • 19
  • 4

2 Answers2

1

Defining a helper function makes things a bit cleaner:

 lag_diff <- function(x) {
   which_nna <- which(!is.na(x))
   out <- rep(NA_integer_, length(x))
   out[which_nna] <- x[which_nna] - shift(x[which_nna])
   out
 }

cols <- c("x", "y", "z")
setDT(df)
df[, paste0("lag_diff_", cols) := lapply(.SD, lag_diff), .SDcols = cols]

Result:

#     x  y  z lag_diff_x lag_diff_y lag_diff_z
# 1:  1 NA  6         NA         NA         NA
# 2:  2  4  2          1         NA         -4
# 3:  3  5  7          1          1          5
# 4: NA  6 14         NA          1          7
# 5: NA NA 20         NA         NA          6
# 6:  9 15 NA          6          9         NA
# 7:  8 14 NA         -1         -1         NA
# 8:  7 13  2         -1         -1        -18
s_baldur
  • 29,441
  • 4
  • 36
  • 69
0

So you are looking for:

library("data.table")

df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
setDT(df)
# diff_x <- df[!is.na(x), lag_diff := x - shift(x)]  # actually what I want, but

lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
df[, lapply(.SD, lag_d)]

or

library("data.table")

df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
as.data.frame(lapply(df, lag_d))
jogo
  • 12,469
  • 11
  • 37
  • 42
  • Many thanks to both of of you, jogo and @sindri_baldur, for your quick reply and very useful solutions to my problem! – Katrin Nov 11 '19 at 09:12