r data.table inserting specific rows if missing into a data table

Question

I have a data table in R that I'm currently working with that contains thousands of pids each tracked over a period of time. Within this data table is an agedays column. I need to ensure that each pid has a row corresponding to the following agedays values :

ageday_values<-c(61, 91, 183, 274, 365, 457, 548, 639, 731)

If the pid already has a row corresponding to an agedays value above then that specific row should not be altered in anyway. If the row isn't present, then when the row is inserted, all columns should hold an NA value with exception to the agedays column of that specific row.

I know how to insert rows over a specific range of values, but am unsure how to do so over a vector of specific values. Does anyone have some insight on how to do this?

small subset of data :

DT <- data.table(pid = c(1,1,1,1,2,2,2,2,3,3,3,3),
                 agedays = c(25,50,61,457,37,365,731,50,61,457,639,731),
                 wt = c(1.2,1.5,1.7,2.0,4.5,6.7,6.8,6.7,4.5,6.6,8.9,6.7),
                 ht = c(3.2,1.8,6.7,2.8,3.5,7.7,9.8,1.7,6.9,3.8,0.9,4.7))

what if an `agedays` value is not present in the `ageday_values` vector, should it be kept? — s_baldur, Aug 04 '20 at 08:49

score 1 · Answer 1 · answered Aug 04 '20 at 03:29

This is easy to do using tidyr::complete

tidyr::complete(DT, pid, agedays = ageday_values)

# A tibble: 31 x 4
#     pid agedays   wt    ht
#   <dbl>   <dbl> <dbl> <dbl>
# 1     1      61  1.7   6.7
# 2     1      91  NA    NA  
# 3     1     183  NA    NA  
# 4     1     274  NA    NA  
# 5     1     365  NA    NA  
# 6     1     457   2     2.8
# 7     1     548  NA    NA  
# 8     1     639  NA    NA  
# 9     1     731  NA    NA  
#10     2      61  NA    NA  
# … with 21 more rows

For every pid value we create all values of ageday_values. If the ageday_values is present it will have wt and ht values else they will have NA's.

score 1 · Accepted Answer · answered Aug 04 '20 at 08:40

In one line:

DT[CJ(pid, agedays = ageday_values, unique = TRUE), on = .(pid, agedays)]

 #    pid agedays  wt  ht
 # 1:   1      61 1.7 6.7
 # 2:   1      91  NA  NA
 # 3:   1     183  NA  NA
 # 4:   1     274  NA  NA
 # 5:   1     365  NA  NA
 # 6:   1     457 2.0 2.8
 # ...

akrun · Answer 3 · 2020-08-04T17:35:30.087

0

We could do a cross join with pid and 'ageday_values', and join on the 'pid', 'agedays'

library(data.table)
DT[DT[, CJ(pid, agedays = union(ageday_values),
        unique = TRUE)], on = .(pid, agedays)]

edited Aug 04 '20 at 17:35

answered Aug 03 '20 at 22:38

akrun

874,273
37
540
662

I ran the code above and get the following error : "Error in `[.data.table`(DT, DT[, CJ(pid, agedays = union(agedays, : Column(s) [pid] not found in I" … the variable pid is in my data table. Any idea why I would get this error? – bziggy Aug 03 '20 at 22:51
@boodaloo1 can you show the `packageVersion('data.table')` – akrun Aug 03 '20 at 22:51
@boodaloo1 i used `packageVersion('data.table') [1] ‘1.12.8’` – akrun Aug 03 '20 at 22:52
sure! [1] ‘1.10.4.3’ – bziggy Aug 03 '20 at 22:52
@boodaloo1 may be the `CJ` was implemented after that 1.10 – akrun Aug 03 '20 at 22:52
1

I will try with your version. Thanks! – bziggy Aug 03 '20 at 22:53
I was able to run this with no error, but the code has filled in all ageday values that are missing in between any ageday values (from the min to the max of each pid) as opposed to filling in the rows corresponding only to the ageday values within the vector "ageday_values". Does your output look the same? – bziggy Aug 03 '20 at 23:54

r data.table inserting specific rows if missing into a data table

3 Answers3