0

I am trying to write a function that replaces the missing values of selected variables in a data frame with their lagged values (I am using a one obs. lag) in R. I have successfully written the following for loop to do this:

testdata <- data.frame(x1 = c(1:10), 
                       x2 = c(4, 3, NA, 7, 8, NA, 9, NA, 10, 11), 
                       x3 = c(4, 3, NA, 7, 8, NA, 9, NA, NA, 11),
                       x4 = c("a", NA, NA, "d", "e", NA, "f", NA, "g", NA))

for (j in 2:4){
  for (i in 1:10){
    if(is.na(testdata[i, j])){
      testdata[i, j] <- testdata[i - 1, j]
    }}}

The for loop works fine, however will I generalize this code and write it in a function the function create an empty list. The function that I have written is as follows:

fill_null <- function(df, columns, rows){
  for (j in columns){
    for(i in rows){
      if(is.na(df[i, j])){
        df[i,j] <- df[i - 1, j]
      } else{
        df[i, j] <- df[i, j]
      }}}}

When I run this function using the following code:

newdf <- fill_null(testdata, 2:4, 1:10)
str(newdf)

I get the following output:

> str(newdf)
 NULL

I am wondering why this for loop will work when it is not called in a function but stops working once it is written into a function. I am also wondering if there is an easy way to fix this issue because I have to fill NA with lagged values for several different data frames.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
benalbert342
  • 71
  • 1
  • 4
  • Put `return(df)` at the end of your function. – Gregor Thomas Oct 16 '19 at 20:01
  • 3
    You may also be interested in the functions `zoo::na.locf` (for a single column) or `tidyr::fill` (for a whole data frame), which do this with more features, and more efficiently. See, for example, [this FAQ on the subject](https://stackoverflow.com/q/7735647/903061). Your function does the same thing as `tidyr::fill(testdata)` – Gregor Thomas Oct 16 '19 at 20:02
  • 2
    Stability is a big advantage of using well tested functions. For example, I think your function will throw an out-of-bounds error if there are `NA` values in the first row of the data. – Gregor Thomas Oct 16 '19 at 20:07
  • I'd also add that R has a a few good ways to apply a function to certain columns of a data frame. Rather that hardcode that in your function, I'd suggest writing a simpler function that works on a single vector. This function is more flexible than what you have, and you can apply it to columns using usual R methods like `for` or `lapply`, e.g., `testdata[columns] = lapply(testdata[columns], simple_fill_null)`. Or if you really want the column interface, write a wrapper that does that `lapply`. Keeping your functions small and modular makes them easier to debug and more flexible to use. – Gregor Thomas Nov 13 '19 at 22:10

1 Answers1

0

In R, functions will (unless told otherwise) return the last value generated. In your function, you may think that the last value is df however it is actually the for loop. As per the documentation in ?"for", for loops return NULL as their value. An easy way to demonstrate this is test <- for(x in 1:3){x}; test which returns NULL.

To fix this, you can either end your function with return(df) or simply df.

To address the heart of your issue however, the dplyr package has a lag function which you may find helpful (testdata$j <- ifelse(is.na(testdata$j), lag(testdata$j), testdata$j)))

Daniel V
  • 1,305
  • 7
  • 23