0

In a data.frame, I would like to create a function that does the following for each row:

  1. Retains NA values prior to the first non-NA
  2. After the first non-NA value, fills forward" NA's with the closest previous non-NA value
  3. Replaces all of the original non-NA values with NAs

I realize that step 2 can be accomplished with the na.locf() function in the 'zoo' package, but I'm unsure about how to write a function that can "recall" which values were originally non-NAs, so that I can replace them with NAs in the last step. Similarly, identifying the value that is the first or last non-NA within each row is straight forward, but it the middle values that have me at a loss. Here's an example with code

#Example input
dm <- data.frame(rbind(c(NA,1,NA,NA,2,NA,NA,3),
                       c(1,1,NA,2,NA,3,3,3),
                       c(NA,NA,5,NA,NA,NA,6,NA)))
#Desired output
dm2 <- data.frame(rbind(c(NA,NA,1,1,NA,2,2,NA),
                        c(NA,NA,1,NA,2,NA,NA,NA),
                        c(NA,NA,NA,5,5,5,NA,6)))
> dm
  X1 X2 X3 X4 X5 X6 X7 X8
1 NA  1 NA NA  2 NA NA  3
2  1  1 NA  2 NA  3  3  3
3 NA NA  5 NA NA NA  6 NA

> dm2
  X1 X2 X3 X4 X5 X6 X7 X8
1 NA NA  1  1 NA  2  2 NA
2 NA NA  1 NA  2 NA NA NA
3 NA NA NA  5  5  5 NA  6 

A little more about my data— it's composed of whole integers or NA values, as shown. Within each row, the numeric values will either stay the same, increase, or be NA, but never decrease. The number of non-NA values could theoretically vary from 1 to ncol.

I realize this is a rather specific question, any suggestions or help is much appreciated!

Abby
  • 1
  • 4
  • 1
    Feels hacky, but `dm2 <- data.frame(t(tidyr::fill(data.frame(t(dm)), X1:X3))); dm2[!is.na(dm)] <- NA` or with zoo, `dm2 <- data.frame(t(apply(dm, 1, zoo::na.locf, na.rm = F))); dm2[!is.na(dm)] <- NA` – alistaire Jan 14 '17 at 00:18
  • You seem to be using *rows* in the data frame as vectors - this is generally bad, data frames are built to work with *columns* as vectors. (Hence the need to `t()` transpose your data in alistaire's comment.) – Gregor Thomas Jan 14 '17 at 00:26
  • @alistaire doesn't feel hacky at all. You're doing a two-step process in two steps. You should convert your comment to an answer. – Gregor Thomas Jan 14 '17 at 00:28

1 Answers1

0

Since you're iterating over rows, rather than columns (a sign you should probably transpose or reshape your data), more effort than usual is required to pass the correct data.frame or vector into tidyr::fill or zoo::na.locf, which will fill the following non-NA values. Once that's done, you can simply assign NA to the new data.frame, subset by a Boolean mask of the values of the original that are not NA.

tidyr requires you pass in a data.frame and only works on columns, so you'll need to transpose your data.frame to use it. t will transpose, but it will also convert the data to a matrix, so data.frame(t(...)) is necessary to transpose and then later re-transpose to the original form. X1:X3 is the specification of the new columns to fill; you could equally use dplyr::everything() here if you're not sure what the transposed columns will be called, or even seq(nrow(dm)).

dm2 <- data.frame(t(tidyr::fill(data.frame(t(dm)), X1:X3)))
dm2[!is.na(dm)] <- NA

dm2
##    X1 X2 X3 X4 X5 X6 X7 X8
## X1 NA NA  1  1 NA  2  2 NA
## X2 NA NA  1 NA  2 NA NA NA
## X3 NA NA NA  5  5  5 NA  6

With zoo::na.locf, you could use its data.frame method similarly:

dm2 <- data.frame(t(zoo::na.locf(data.frame(t(dm)))))
dm2[!is.na(dm)] <- NA

or use its vector method with apply:

dm2 <- data.frame(t(apply(dm, 1, zoo::na.locf, na.rm = FALSE)))
dm2[!is.na(dm)] <- NA

Note you'll need to set its na.rm parameter to FALSE so as not to lose the leading NAs. All approaches return the same result.

Also note it's actually safer (but less readable, for me) to use is.na<- for the second line:

is.na(dm2) <- !is.na(dm)

Here it works identically.

alistaire
  • 42,459
  • 4
  • 77
  • 117