data.table outer join based on groups in R

Question

I have a data with the following columns:

CaseID, Time, Value.

The 'time' column values are not at regular intervals of 1. I am trying to add the missing values of time with 'NA' for the rest of the columns except CaseID.

Case Value  Time
1    100    07:52:00
1    110    07:53:00
1    120    07:55:00
2    10     08:35:00
2    11     08:36:00
2    12     08:38:00

Desired output:

Case Value  Time
1    100    07:52:00
1    110    07:53:00
1    NA     07:54:00
1    120    07:55:00
2    10     08:35:00
2    11     08:36:00
2    NA     08:37:00
2    12     08:38:00

I tried dt[CJ(unique(CaseID),seq(min(Time),max(Time),"min"))] but it gives the following error:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in 9827315 rows; more than 9620640 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

I cannot able to make it work..any help would be appreciated.

@DavidArenburg I have edited the question with my desired output and the error I am getting — UoU, Dec 18 '14 at 20:54

jlhoward · Answer 1 · 2014-12-18T21:54:58.590

3

Like this??

dt[,Time:=as.POSIXct(Time,format="%H:%M:%S")]
result <- dt[,list(Time=seq(min(Time),max(Time),by="1 min")),by=Case]
setkey(result,Case,Time)
setkey(dt,Case,Time)
result <- dt[result][,Time:=format(Time,"%H:%M:%S")]
result
#    Case Value     Time
# 1:    1   100 07:52:00
# 2:    1   110 07:53:00
# 3:    1    NA 07:54:00
# 4:    1   120 07:55:00
# 5:    2    10 08:35:00
# 6:    2    11 08:36:00
# 7:    2    NA 08:37:00
# 8:    2    12 08:38:00

Another way:

dt[, Time := as.POSIXct(Time, format = "%H:%M:%S")]
setkey(dt, Time)
dt[, .SD[J(seq(min(Time), max(Time), by='1 min'))], by=Case]

We group by Case and join on Time on each group using .SD (hence setting key on Time). From here you can use format() as shown above.

edited Dec 18 '14 at 21:54

answered Dec 18 '14 at 21:15

jlhoward

58,004
7
97
140

This is giving the same error, but now for 1 single group I guess. Error: Join results in 1504 rows; more than 1440 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. The total size of the dataset is 1000100 rows – UoU Dec 18 '14 at 21:59
@UoU, did you read the error message that says how to fix it? Check [this](http://stackoverflow.com/a/23087759/559784) or [this](http://stackoverflow.com/q/25204859/559784) post for additional explanation. – Arun Dec 18 '14 at 22:02
Have you checked if there multiple records with the same Time for a given Case? – jlhoward Dec 18 '14 at 22:04
I read the message and tried it with allow.cartesian=TRUE..but is still gives the same error – UoU Dec 18 '14 at 22:08
@jlhoward I tried it with 'distinct' rows but it still doesn't work – UoU Dec 18 '14 at 22:09
2

Well, obviously it works on the sample you provided, which does not appear to be representative. So you need yo provide a representative sample, or upload the full dataset somewhere and provide a link in your question. – jlhoward Dec 18 '14 at 22:11
The point is that the times can actually overlap. That's what makes it trickier. Have to reconsider my answer. – Serban Tanasa Dec 18 '14 at 22:46

data.table outer join based on groups in R

1 Answers1