2

Reproducible dataset:

library(data.table)
library(dplyr)
library(zoo)

df = expand.grid(ID = sample(LETTERS[1:5]),
                 Date = seq.Date(as.Date("2012-01-01"), as.Date("2012-12-01"), by = "1 month"))
df = df[order(as.character(df$ID)),]
df = data.table(df, V1 = runif(nrow(df),0,1), V2 = runif(nrow(df),0,1), V3 = runif(nrow(df),0,1))

ind = sample(nrow(df), nrow(df)*.5)
na.gen <- function(x, ind){x[ind] <- NA}
df1 <- df %>% slice(., ind) %>% mutate_each(funs(na.gen), starts_with("V"))
df2 = df[!ind]
df <- rbind(df1, df2)
df <- df[order(as.character(df$ID), df$Date),]
df$ID = as.character(df$ID)

In the above dataset, my idea was to impute data using Last Observation Carried Forward method. My original problem is a very large dataset, so I tested dplyr and data.table solutions.

final_dplyr <- df %>% group_by(ID) %>% mutate_each(funs(na.locf), starts_with("V"))
final_data.table <- df[, na.locf(.SD), by = ID]

data.table gives me the right solution, however, dplyr messes the subset which begins from NA. I get the following warning using dplyr:

Warning messages:
1: In `[.data.table`(`_dt`, , `:=`(V1, na.locf(V1)), by = `_vars`) :
  Supplied 11 items to be assigned to group 1 of size 12 in column 'V1' (recycled leaving remainder of 1 items).

Can somone help me understand what I am doing wrong with dplyr?

Divi
  • 1,614
  • 13
  • 23
  • 1
    You're writing these very differently. na.locf(.SD) is operating on the data set, while the mutate_each operates on each column independently... – Frank Jun 16 '16 at 20:13
  • But, from what I understand so far, `na.locf` should perform the same imputation on a column vector or a data frame. I had initially written the `dplyr` version for my problem, but when it did not work I had to write the `data.table` version, which to my surprise works. Now I am not clear, why one works, but the other does not. – Divi Jun 16 '16 at 20:35

1 Answers1

8

Okay, a lot of things going on here. First as @Frank noted, the two commands operate on different objects. na.locf(.SD) on the subset-data.table for each ID, where as dplyr's on each column separately for each ID.

To identify where the issue is, I'll use data.table equivalent of your dplyr syntax.

df[, lapply(.SD, na.locf), by=ID]
# warning

We get the same warning message. Seems like the number of rows returned for each column aren't identical for 1 or more groups. Let's check that.

df[, lapply(.SD, function(x) length(na.locf(x))), by=ID]
#    ID Date V1 V2 V3
# 1:  A   12 12 12 12
# 2:  B   12 12 12 12
# 3:  C   12 11 11 11 # <~~~ we've a winner!
# 4:  D   12 12 12 12
# 5:  E   12 12 12 12

Why is this happening?

head(df[ID == "C"])
#    ID       Date        V1        V2        V3
# 1:  C 2012-01-01        NA        NA        NA
# 2:  C 2012-02-01 0.7475075 0.8917311 0.7601174
# 3:  C 2012-03-01 0.4922747 0.7749479 0.3995417
# 4:  C 2012-04-01 0.9013631 0.3388313 0.8873779
# 5:  C 2012-05-01        NA        NA        NA
# 6:  C 2012-06-01        NA        NA        NA

nrow(df[ID == "C", na.locf(.SD), .SDcols= -c("ID")])
# 12 as expected

nrow(df[ID == "C", lapply(.SD, na.locf), .SDcols= -c("ID")])
# 12, but with warnings

Using na.locf() on columns separately returns 11 for V1:V4. Why? It seems like it's because of the NA at the beginning. ?na.locf has a na.rm argument which by default is set to TRUE which removes NAs from the beginning. So let's set it to false and try again

nrow(df[ID == "C", lapply(.SD, na.locf, na.rm=FALSE), .SDcols = -c("ID")])
# 12, no warnings

It worked with na.locf(.SD) because it also ran na.locf on Date column which returned 12 rows, I think.

In essence, you need to set na.rm=FALSE in dplyr somehow, or get dplyr to work on the entire object somehow. I've no idea how to do either.

PS: Note that you can use := to update the data.table by reference instead of returning a new object with data.table syntax.

Arun
  • 116,683
  • 26
  • 284
  • 387
  • Regarding what to do in dplyr, this seems to work `df %>% group_by(ID) %>% mutate_each(funs(na.locf(., na.rm=FALSE)), starts_with("V"))` – Frank Jun 16 '16 at 20:40
  • Another thing that seems to work: `df %>% group_by(ID) %>% do(na.locf(. %>% select(starts_with("V"))))` (Not saying these should be added to the answer because I'm not really sure they work... oh, seems I lose the Date column here or something.) – Frank Jun 16 '16 at 20:42
  • Thanks a lot, I was only wonderig whether it was a `dplyr` issue. @Frank, yes both of them work. – Divi Jun 16 '16 at 20:44
  • `na.omit()` - http://stackoverflow.com/questions/26665319/removing-na-in-dplyr-pipe – leerssej Jun 16 '16 at 20:53