Efficient Way to Merge Data and Fill NAs with Inconsistent Time Frame

Question

So I have two tables:

Table1: 
ID  Yr     Qty  Cum_Qty
A   2013    3   3
A   2015    3   6
A   2016    2   8
B   2006    1   1  

Table2:
ID  Yr  
A   2013    
A   2014    
A   2015    
A   2016    
A   2017
B   2016    
B   2017    
C   2016
C   2017

This is what I want to achieve - I want to add Cum_Qty to table 2 and inherit most recent Cum_Qty if one exists:

ID  Yr  Cum_Qty
A   2013    3
A   2014    3
A   2015    6
A   2016    8
A   2017    8   
B   2016    1
B   2017    1
C   2016    0
C   2017    0

If I currently do a left_join(table2, table1, by = c("ID", "Yr"), I end up with NA values in Cum_Qty if specific year is missing in table 2. I thought about using lag function but that would leave me with NAs for the first year record and wouldn't work for ID C, which is missing from table1. I would like to use dplyr::mutate so my code stays in the pipes.

I think I know how to do this with 3~4 mutates, but I want to find a way to simplify the code. Does anyone have any recommendations on what I can do?

To create tables in R:

table_1 <- data.frame(
    ID = c("A", "A", "A", "B"), 
    Yr = c(2013, 2015, 2016, 2006),
    Qty = c(3, 3, 2, 1)) %>%
    arrange(ID, Yr) %>%
    group_by(ID) %>%
    mutate(Cum_Qty = cumsum(Qty))

table_2 <- data.frame(
    ID = c("A", "A", "A", "A", "A", "B", "B", "C", "C"),
    Yr = c(2013, 2014, 2015, 2016, 2017, 2016, 2017, 2016, 2017))

score 1 · Answer 1 · answered Dec 15 '17 at 02:04

Given that you want to use some data from table_1 that is not present in table_2, I don't think you can start out with a left_join. You can start with a full_join, then use the na.locf from the zoo package to get the last non-NA value per ID group, before doing an inner_join to only keep rows in table_2 at the end:

library(dplyr)
library(zoo)
full_join(table_2,table_1,by=c("ID","Yr")) %>% 
group_by(ID) %>%
arrange(ID,Yr) %>%
mutate(Cum_Qty = na.locf(Cum_Qty,na.rm=F)) %>%
mutate(Cum_Qty = ifelse(is.na(Cum_Qty),0,Cum_Qty)) %>%
select(-Qty) %>%
inner_join(table_2)

# A tibble: 9 x 3
# Groups:   ID [?]
     ID    Yr Cum_Qty
  <chr> <dbl>   <dbl>
1     A  2013       3
2     A  2014       3
3     A  2015       6
4     A  2016       8
5     A  2017       8
6     B  2016       1
7     B  2017       1
8     C  2016       0
9     C  2017       0

Thanks for your answer! I wasn't aware of zoo::na.locf. My tables are MUCH bigger data sets and I want to do most of the data manipulation using Spark; I'm not sure whether if I use na.locf dplyr/sparklyr would automatically translate the function though :/ — creativename, Dec 15 '17 at 04:27
I'm not familiar with sparklyr, so I don't know how you could go about it. It does seem that sparklyr only supports a limited range of functions (https://stackoverflow.com/questions/39494484/sparkr-vs-sparklyr), although one of the comments mentions a new spark_apply() function in the latest version, you may want to have a look at that. — Lamia, Dec 15 '17 at 23:13

Efficient Way to Merge Data and Fill NAs with Inconsistent Time Frame

1 Answers1