Duplicate rows in a data table based on a function in R

Question

I have a data table with a column for the starting_time for an event. I need to duplicate the table for each minute between 7am and the starting_time.

I also need a new column adding for each duplicated data set, if there were 10 minutes between 7am and the starting_time, the data set would be duplicated 10 times. The first set would show 7:01am, the next set 7:02am etc with the final set being the same time as the starting_time.

I am currently doing this with 2 loops but it very slow

dt <- data.table(id = c(1, 1, 1, 2, 2, 2, 3, 3, 3),
                 time = c(rep("2020-07-23 12:00:00 BST", 3), rep("2020-07-23 12:10:00 BST", 3), rep("2020-07-23 12:20:00 BST", 3)),
                 h = c("a", "b", "c", "d", "e", "f", "g", "h", "i"))



s <- data.table(NULL)
fs <- data.table(NULL)
ids <- unique(dt$id)

start_time <- strptime(paste0(lubridate::ymd(Sys.Date()), " 11:50:00"), "%Y-%m-%d %H:%M:%S")

for (i in seq_along(ids)) {
  
  d <- dt[id == ids[i]]
  diff_minutes <- as.numeric(difftime(max(d$time), start_time, units='mins'))
  
  for (j in 1:diff_minutes) {
    
    ref_time <- start_time + (j * 60)
    d[, time_stamp := ref_time]
    d[, time_stamp_hour := substr(time_stamp, 12, 19)]
    s <- rbindlist(list(s, d), use.names = T, fill = T)
    
  }
  
  fs <- rbindlist(list(fs, s))
  s <- data.table(NULL)
}

The outer loop goes through each event since they all start at different times so the number of replications will be different. The inner loop is doing the duplicating and adding the new column.

This currently takes around a minute to run on the real data set which isn't a huge problem but it feels very inefficient.

Is there a more effecient way to do this using any of the functions within the data table package or without using loops?

Technically speaking the first data set should be the same as the start_time so a 10 minute difference should result in 11 copies of the table but to keep it simple I just ignored it.

iago · Accepted Answer · 2020-07-23T15:20:46.000

UPDATE

Let's try:

> dt[, min_seq := lapply(as.POSIXct(time), function(end) seq(start_time,end, by = "mins"))]
> res <- dt[ , list( mins = do.call("unlist",  min_seq) ) , by = .(id,time,h) ]
> res[, mins2 := substr(mins, 12, 16)]
> head(res)
   id                    time h                mins mins2
1:  1 2020-07-23 12:00:00 BST a 2020-07-23 11:50:00 11:50
2:  1 2020-07-23 12:00:00 BST a 2020-07-23 11:51:00 11:51
3:  1 2020-07-23 12:00:00 BST a 2020-07-23 11:52:00 11:52
4:  1 2020-07-23 12:00:00 BST a 2020-07-23 11:53:00 11:53
5:  1 2020-07-23 12:00:00 BST a 2020-07-23 11:54:00 11:54
6:  1 2020-07-23 12:00:00 BST a 2020-07-23 11:55:00 11:55
> tail(res)
   id                    time h                mins mins2
1:  3 2020-07-23 12:20:00 BST i 2020-07-23 12:15:00 12:15
2:  3 2020-07-23 12:20:00 BST i 2020-07-23 12:16:00 12:16
3:  3 2020-07-23 12:20:00 BST i 2020-07-23 12:17:00 12:17
4:  3 2020-07-23 12:20:00 BST i 2020-07-23 12:18:00 12:18
5:  3 2020-07-23 12:20:00 BST i 2020-07-23 12:19:00 12:19
6:  3 2020-07-23 12:20:00 BST i 2020-07-23 12:20:00 12:20
> str(res)
Classes ‘data.table’ and 'data.frame':  189 obs. of  5 variables:
 $ id   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ time : chr  "2020-07-23 12:00:00 BST" "2020-07-23 12:00:00 BST" "2020-07-23 12:00:00 BST" "2020-07-23 12:00:00 BST" ...
 $ h    : chr  "a" "a" "a" "a" ...
 $ mins : POSIXct, format: "2020-07-23 11:50:00" "2020-07-23 11:51:00" "2020-07-23 11:52:00" "2020-07-23 11:53:00" ...
 $ mins2: chr  "11:50" "11:51" "11:52" "11:53" ...
 - attr(*, ".internal.selfref")=<externalptr>

Old wrong answer

Starting from your objects dt and start_time and taking into account https://stackoverflow.com/a/14165493/997979 I would try (of course, the way in which I use substr could be sharper):

> seqmin <- seq(as.POSIXct(strptime(paste0(substr(start_time,1,10)," 07:00:00"), "%Y-%m-%d %H:%M:%S")), start_time, by = "mins")
> seqmin <- substr(seqmin, 12,16)
> seqmin <- data.table(seqmin)
> res <- setkey(dt[,c(k=1,.SD)],k)[seqmin[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]
> head(res)
   id                    time h seqmin
1:  1 2020-07-23 12:00:00 BST a  07:00
2:  1 2020-07-23 12:00:00 BST b  07:00
3:  1 2020-07-23 12:00:00 BST c  07:00
4:  2 2020-07-23 12:10:00 BST d  07:00
5:  2 2020-07-23 12:10:00 BST e  07:00
6:  2 2020-07-23 12:10:00 BST f  07:00
> tail(res)
   id                    time h seqmin
1:  2 2020-07-23 12:10:00 BST d  11:50
2:  2 2020-07-23 12:10:00 BST e  11:50
3:  2 2020-07-23 12:10:00 BST f  11:50
4:  3 2020-07-23 12:20:00 BST g  11:50
5:  3 2020-07-23 12:20:00 BST h  11:50
6:  3 2020-07-23 12:20:00 BST i  11:50
> str(res)
Classes ‘data.table’ and 'data.frame':  2619 obs. of  4 variables:
 $ id    : num  1 1 1 2 2 2 3 3 3 1 ...
 $ time  : chr  "2020-07-23 12:00:00 BST" "2020-07-23 12:00:00 BST" "2020-07-23 12:00:00 BST" "2020-07-23 12:10:00 BST" ...
 $ h     : chr  "a" "b" "c" "d" ...
 $ seqmin: chr  "07:00" "07:00" "07:00" "07:00" ...
 - attr(*, ".internal.selfref")=<externalptr>

I don't think this quite works. `start_time` is the time the process actually starts running, the column `time` in the table is the time the event starts. I need the time sequencing from the `start_time` to the `time` in the table. I made it `11:50` to keep the size down for this example. — MidnightDataGeek, Jul 23 '20 at 12:09
Then, the `starting_time` you mention is not the `start_time` but the `time` column. And do you need to "duplicate" the table? So, do you want the 9 rows of your example `dt` with a column with minute `12:15`. Or just the 3 last rows, since the first six have `time` previous to `12:15`? Can you clarify your question? — iago, Jul 23 '20 at 13:21
if you see my example, id 1 has 3 rows and the `time` is 10 minutes after the `start_time`. So that group is repeated 10x and the time within each group goes up in 1 minute intervals frme the `start_time` up to the `time`. id 2 starts 10 minutes later and so each group will be repeated 20x, again each group goes up from the `start_time` to the `time` in 1 minute intervals and so on. — MidnightDataGeek, Jul 23 '20 at 14:01
@MidnightDataGeek I deleted my previous comment due to I localize it was produced by an error. — iago, Jul 24 '20 at 08:29
thanks for the solution, the only drawback to this is my data set has 50+ columns and I need all of them duplicating but other than that it worked well. — MidnightDataGeek, Jul 24 '20 at 12:05

Duplicate rows in a data table based on a function in R

1 Answers1