2

This is probably a very easy question, but I cannot seem to figure it out...

I have the following list

l <- list(May=data.frame(date=c(NA, as.Date("2019/5/1"),  NA,  NA,  NA, NA, as.Date("2019/5/2"),  NA, NA, NA, NA, NA, NA, NA), ID = c( "107349", "110024", "6187"  , "100420", "94436",  "88995" , "110165" ,"91644",  "108508", "105213", "108773", "102636" ,"102339" ,"100413")),
        April = data.frame(date=c(as.Date("2019/4/1"), as.Date("2019/4/2"),  NA,  NA,  NA,  NA,  NA, NA, NA, NA,  NA, NA, as.Date("2019/4/3"), NA, as.Date("2019/4/4"),  NA, NA, NA, NA, NA), ID=c("37866",  "107349", "93051",  "6187",   "98274",  "100420", "94436",  "88995"  ,"105107", "105109", "91644",  "105103" ,"108508" ,"105213", "108773", "85409"  ,"104145","102636" ,"102339" ,"100413")),
        March = data.frame(date= c(NA, NA,  NA,  NA,  NA,  NA, NA, NA, NA,  NA, NA, as.Date("2019/3/1"),  NA, NA, NA, NA, NA, NA), ID=c("93051" , "104499" ,"6187",   "98274",  "100420" ,"94436",  "88995"  ,"105107" ,"105109", "91644"  ,"105103", "105213" ,"85409" , "104145", "100989", "102636" ,"102339", "100413")),
        February = data.frame(date= c(NA , NA, as.Date("2019/2/1"),  NA,  NA,  NA,  NA ,as.Date("2019/2/2"), as.Date("2019/2/3"), as.Date("2019/2/4"),  NA, as.Date("2019/2/5"),  NA ,NA, as.Date("2019/2/6"), NA, NA, NA, NA, NA, NA, NA), ID=c("94266" , "93051",  "104499" ,"6187" ,  "98274",  "100420", "94436"  ,"88995",  "105107", "105109", "91644"  ,"105103", "85409"  ,"102252", "104145", "94559",  "101426", "100992" ,"100989" ,"102636" ,"102339" ,"100413")),
        January = data.frame(date = seq(as.Date("2019/1/1"),  by = "day", length.out = 18), ID=c("94266" , "93051",  "99836",  "6187" ,  "98274",  "100420", "94436",  "91644",  "85409",  "102252", "94412",  "94559",  "101426", "100992", "100989", "102636", "102339", "100413")))

I'm trying to match a specific value in one column (Date) and replace it with the corresponding value from the same column if values in the other column (ID) are the same. The date column should be the same across all dataframes if the corresponding ID values in the ID columns match but I've got a date only for the first time the ID appears and NAs for the following appearances of an ID.

I tried using match and subset but I couldn't figure it out.

dyel
  • 69
  • 8
  • Why is your data so inconsistent? date in February data.frame looks like `17928` while date in January data frame looks like `2019-01-01`. Also what's the ` eRec` in February data frame. – Adam Quek Aug 08 '19 at 06:47
  • Also not really understanding what you're trying to achieve here too. Can you provide an example of what is the final output you wish to obtain? – Adam Quek Aug 08 '19 at 06:48
  • @Adam Quek: the inconsistencies in the dates are because certain date columns start with `NA` instead of a valid date, and the rest of the dates in the column are transformed to numeric. – Joris C. Aug 08 '19 at 06:51
  • @JorisChau thanks! I didn't know that CRAN will force a date column into a numeric one automatically. good to know. – Adam Quek Aug 08 '19 at 06:54

2 Answers2

2

First change date columns as Dates instead of numbers

l <- lapply(l, function(x) {x$date <- as.Date(x$date, origin = "1970-01-01");x})

We can then use bind_rows to bind list of dataframes into one, group_by ID, fill the NA dates and split the dataframe back to list of dataframes using group_split.

library(dplyr)

bind_rows(l, .id = "group") %>%
   mutate(group = factor(group, levels = names(l))) %>%
   group_by(ID) %>%
   tidyr::fill(date) %>%
   tidyr::fill(date, .direction = "up") %>%
   ungroup %>%
   group_split(group, keep = FALSE) %>%
   setNames(names(l))

#$May
# A tibble: 14 x 2
#   date       ID    
#   <date>     <chr> 
# 1 2019-04-02 107349
# 2 2019-05-01 110024
# 3 2019-01-04 6187  
# 4 2019-01-06 100420
# 5 2019-01-07 94436 
# 6 2019-02-02 88995 
# 7 2019-05-02 110165
# 8 2019-01-08 91644 
# 9 2019-04-03 108508
#10 2019-03-01 105213
#11 2019-04-04 108773
#12 2019-01-16 102636
#13 2019-01-17 102339
#14 2019-01-18 100413
#...

This is assuming that every ID has atleast one non-NA date in the the entire list. When we group_by ID that non-NA value can be above or below NA values with same ID hence, we need to fill NA values in both the direction (default is "down"). We create "group" column during bind_rows to identify which values came from which list so that we can use it to split it again later.

Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Could you elaborate on that please? `fill` seems cool, but the double usage seems an even neater trick. Don't have my R session on right now to check but will do – tjebo Aug 08 '19 at 07:03
  • 1
    @Tjebo added some explanation in the answer. – Ronak Shah Aug 08 '19 at 07:11
  • Perhaps I misunderstood the question, but how come the `May` output contains 20 rows, whereas the `May` input contains only 14 rows? – Joris C. Aug 08 '19 at 09:03
  • 1
    @JorisChau no..you did not. I did not check the output correctly. I have updated it now, should be correct. Thanks for noticing :) – Ronak Shah Aug 08 '19 at 09:12
  • Thanks had a look at that. Neat. One could also use [@akrun's nice trick](https://stackoverflow.com/a/25267681/7941188) to arrange NAs at first position – tjebo Aug 08 '19 at 11:26
2

As OP mentioned trying match and subset, here is another approach using subset to create an initial lookup data.frame and filling in missing values with match:

lookup <- do.call("rbind", l)
lookup <- subset(lookup, !is.na(lookup$date))

lapply(l, function(x) { x$date <- lookup$date[match(x$ID, lookup$ID)]; x })
#> $May
#>          date     ID
#> 1  2019-04-02 107349
#> 2  2019-05-01 110024
#> 3  2019-01-04   6187
#> 4  2019-01-06 100420
#> 5  2019-01-07  94436
#> 6  2019-02-02  88995
#> 7  2019-05-02 110165
#> 8  2019-01-08  91644
#> 9  2019-04-03 108508
#> 10 2019-03-01 105213
#> 11 2019-04-04 108773
#> 12 2019-01-16 102636
#> 13 2019-01-17 102339
#> 14 2019-01-18 100413
#> 
#> ...

Data

Note that the data has been modified such that all date columns are of class Date.

l <- list(May = structure(list(date = structure(c(NA, 18017, NA, NA, 
NA, NA, 18018, NA, NA, NA, NA, NA, NA, NA), class = "Date"), 
    ID = c("107349", "110024", "6187", "100420", "94436", "88995", 
    "110165", "91644", "108508", "105213", "108773", "102636", 
    "102339", "100413")), class = "data.frame", row.names = c(NA, 
-14L)), April = structure(list(date = structure(c(17987, 17988, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 17989, NA, 17990, NA, 
NA, NA, NA, NA), class = "Date"), ID = c("37866", "107349", "93051", 
"6187", "98274", "100420", "94436", "88995", "105107", "105109", 
"91644", "105103", "108508", "105213", "108773", "85409", "104145", 
"102636", "102339", "100413")), class = "data.frame", row.names = c(NA, 
-20L)), March = structure(list(date = structure(c(NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 17956, NA, NA, NA, NA, NA, NA
), class = "Date"), ID = c("93051", "104499", "6187", "98274", 
"100420", "94436", "88995", "105107", "105109", "91644", "105103", 
"105213", "85409", "104145", "100989", "102636", "102339", "100413"
)), class = "data.frame", row.names = c(NA, -18L)), February = structure(list(
    date = structure(c(NA, NA, 17928, NA, NA, NA, NA, 17929, 
    17930, 17931, NA, 17932, NA, NA, 17933, NA, NA, NA, NA, NA, 
    NA, NA), class = "Date"), ID = c("94266", "93051", "104499", 
    "6187", "98274", "100420", "94436", "88995", "105107", "105109", 
    "91644", "105103", "85409", "102252", "104145", "94559", 
    "101426", "100992", "100989", "102636", "102339", "100413"
    )), class = "data.frame", row.names = c(NA, -22L)), January = structure(list(
    date = structure(17897:17914, class = "Date"), ID = c("94266", 
    "93051", "99836", "6187", "98274", "100420", "94436", "91644", 
    "85409", "102252", "94412", "94559", "101426", "100992", 
    "100989", "102636", "102339", "100413")), class = "data.frame", row.names = c(NA, 
-18L)))
Joris C.
  • 5,721
  • 3
  • 12
  • 27