3

I have a data with the following columns:

CaseID, Time, Value.

The 'time' column values are not at regular intervals of 1. I am trying to add the missing values of time with 'NA' for the rest of the columns except CaseID.

Case Value  Time
1    100    07:52:00
1    110    07:53:00
1    120    07:55:00
2    10     08:35:00
2    11     08:36:00
2    12     08:38:00

Desired output:

Case Value  Time
1    100    07:52:00
1    110    07:53:00
1    NA     07:54:00
1    120    07:55:00
2    10     08:35:00
2    11     08:36:00
2    NA     08:37:00
2    12     08:38:00

I tried dt[CJ(unique(CaseID),seq(min(Time),max(Time),"min"))] but it gives the following error:

Error in vecseq(f__, len__, if (allow.cartesian || notjoin) NULL else as.integer(max(nrow(x),  : 
  Join results in 9827315 rows; more than 9620640 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice.

I cannot able to make it work..any help would be appreciated.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
UoU
  • 31
  • 2

1 Answers1

3

Like this??

dt[,Time:=as.POSIXct(Time,format="%H:%M:%S")]
result <- dt[,list(Time=seq(min(Time),max(Time),by="1 min")),by=Case]
setkey(result,Case,Time)
setkey(dt,Case,Time)
result <- dt[result][,Time:=format(Time,"%H:%M:%S")]
result
#    Case Value     Time
# 1:    1   100 07:52:00
# 2:    1   110 07:53:00
# 3:    1    NA 07:54:00
# 4:    1   120 07:55:00
# 5:    2    10 08:35:00
# 6:    2    11 08:36:00
# 7:    2    NA 08:37:00
# 8:    2    12 08:38:00

Another way:

dt[, Time := as.POSIXct(Time, format = "%H:%M:%S")]
setkey(dt, Time)
dt[, .SD[J(seq(min(Time), max(Time), by='1 min'))], by=Case]

We group by Case and join on Time on each group using .SD (hence setting key on Time). From here you can use format() as shown above.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • This is giving the same error, but now for 1 single group I guess. Error: Join results in 1504 rows; more than 1440 = max(nrow(x),nrow(i)). Check for duplicate key values in i, each of which join to the same group in x over and over again. If that's ok, try including `j` and dropping `by` (by-without-by) so that j runs for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and datatable-help for advice. The total size of the dataset is 1000100 rows – UoU Dec 18 '14 at 21:59
  • @UoU, did you read the error message that says how to fix it? Check [this](http://stackoverflow.com/a/23087759/559784) or [this](http://stackoverflow.com/q/25204859/559784) post for additional explanation. – Arun Dec 18 '14 at 22:02
  • Have you checked if there multiple records with the same Time for a given Case? – jlhoward Dec 18 '14 at 22:04
  • I read the message and tried it with allow.cartesian=TRUE..but is still gives the same error – UoU Dec 18 '14 at 22:08
  • @jlhoward I tried it with 'distinct' rows but it still doesn't work – UoU Dec 18 '14 at 22:09
  • 2
    Well, obviously it works on the sample you provided, which does not appear to be representative. So you need yo provide a representative sample, or upload the full dataset somewhere and provide a link in your question. – jlhoward Dec 18 '14 at 22:11
  • The point is that the times can actually overlap. That's what makes it trickier. Have to reconsider my answer. – Serban Tanasa Dec 18 '14 at 22:46