1

I'm trying to add missing rows to the dataframe (within each value of NO_REF) and at the same time do linear interpolation on some columns and insert last non NA values on others. I can't figure out how to prevent inserting missing dates when DATE_X value after the gap is greater than the last DATE value before the gap.

Here's my dataframe:

df = data.frame(DATE = as.Date(c("2016-01-31","2016-03-31","2016-05-31","2016-08-31","2016-12-31","2016-02-29","2016-04-30","2016-06-30","2016-08-31","2016-10-31","2016-12-31","2015-01-31","2015-02-28","2015-06-30","2015-10-31","2015-12-31")), 
            DATE_X = as.Date(c("2010-01-31","2010-01-31","2016-04-30","2015-03-31","2015-03-31","2010-10-31","2010-10-31","2016-05-31","2016-05-31","2015-07-31","2015-07-31","2013-01-31","2013-01-31","2013-01-31","2015-09-30","2015-09-30")),
            NO_REF = c("P1","P1","P1","P2","P2","O1","O1","O1","O1","R1","R2","Q1","Q1","Q1","Q1","Q1"),
            KAP = as.double(15:30),
            DIV =c("PI","PI","PI","PI","PI","OP","OP","OP","OP","PR","PR","OP","OP","OP","OP","OP"))

and here's my code:

library(dplyr)
library(multidplyr)
library(zoo)

cluster <- create_cluster(3)
cluster_eval(cluster,library(dplyr))
cluster_eval(cluster,library(zoo))

result = df %>% partition(NO_REF,cluster=cluster) %>%
group_by(NO_REF) %>%
do(left_join(data.frame(NO_REF = .$NO_REF[1], DATE = seq(min(.$DATE)+1, max(.$DATE)+1, by="1 month")-1), ., 
           by=c("NO_REF","DATE"))) %>%  mutate(DATE_X=na.locf(DATE_X, na.rm=FALSE),
             DIV=na.locf(DIV, na.rm=FALSE), KAP=na.approx(KAP)) %>% collect()

In the following table blue rows should not be in the final result.

expected result:

enter image description here

Thank you for any help in advance!

theoretisch
  • 1,718
  • 5
  • 24
  • 34
Koles
  • 47
  • 5
  • I think you should use bind_rows http://stackoverflow.com/questions/41763568/summarise-for-multiple-group-by-variables-combined-and-individually#comment70720398_41763568 – jarry jafery Jan 23 '17 at 16:38

1 Answers1

1

This might not be the most efficient way, but I think it does what you want:

library(dplyr)
library(multidplyr)
library(zoo)

cluster <- create_cluster(3)
cluster_eval(cluster,library(dplyr))
cluster_eval(cluster,library(zoo))

result = df %>% partition(NO_REF,cluster=cluster) %>%
    group_by(NO_REF) %>%
    do(left_join(data.frame(NO_REF = .$NO_REF[1], DATE = seq(min(.$DATE)+1, max(.$DATE)+1, by="1 month")-1), ., 
       by=c("NO_REF","DATE"))) %>%  
    filter(!(is.na(DATE_X) & 
             na.locf(DATE_X, fromLast=TRUE, na.rm=FALSE)>
             na.locf(DATE+days(ifelse(is.na(DATE_X), NA, 0)), na.rm=FALSE))) %>% 
    mutate(DATE_X=na.locf(DATE_X, na.rm=FALSE),
           DIV=na.locf(DIV, na.rm=FALSE), 
           KAP=na.approx(KAP)) %>% 
    collect()

In brief, the DATE column is treated as NA and carried forward where DATE_Xis missing, DATE_X is carried backward, and rows where the latter is larger than the former while DATE_X is missing are removed.

mpjdem
  • 1,504
  • 9
  • 14