0

I'm interested in finding out how to sum a column based on the seconds value of a time column in a data.table

For example, let's say I have a data table as follows:

Time                   |      Inventory
----------------------------------------------     
08-01-2001 11:50:12    |       5
08-01-2001 11:50:16    |       8
08-01-2001 11:50:17    |       2
08-01-2001 11:50:17    |       1
08-01-2001 11:50:19    |       10
08-01-2001 11:50:23    |       5
08-01-2001 11:50:23    |       9
08-01-2001 11:51:23    |       12

Then I would like to be able to be able to sum the Inventory based on the seconds value in the Time column such that I get a data.table as follows:

Time                   |      Inventory
----------------------------------------------     
08-01-2001 11:50:12    |       5
08-01-2001 11:50:16    |       8
08-01-2001 11:50:17    |       3
08-01-2001 11:50:17    |       3
08-01-2001 11:50:19    |       10
08-01-2001 11:50:23    |       14
08-01-2001 11:50:23    |       14
08-01-2001 11:51:23    |       12

I've tried using variations of the Aggregate() function but these always seem to delete rows which are repeated, and I do not wish to do this. Is there a way I could perform this action using data.table or perhaps xts? Thanks in advance.

EDIT: Here is the dput output:

structure(list(Timecol = c("0008-01-20 00:00:00", "0008-01-20 00:00:00", "0008-01-20 00:00:00", "0008-01-20 00:00:00", "0008-01-20 00:00:00", "0008-01-20 00:00:00", "0008-01-20 00:00:00", "0008-01-20 00:00:00" ), ID = c("11", "11", "11", "11", "11", "11", "11", "11"), Inventorycol = c("5", "8", "2", "1", "10", "5", "9", "12")), .Names = c("Timecol", "ID", "Inventorycol"), row.names = c(NA, -8L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x00000000028b0788>)

  • 1
    You want to sum based on *just* the seconds? Or on the entire timestamp? I.e would 11:50:00 and 11:51:00 sum together or not? – dww Dec 10 '17 at 04:59
  • @dww I want to sum only on the seconds but if they are different minutes/hours/days etc. then I don't want them to sum together, so 11:50:00 and 11:51:00 won't sum together since they are on different minutes. I'll update my original post to include that detail :) – reallybadstatdude Dec 10 '17 at 05:02
  • 1
    While you're updating please add output of `dput` on your data.table. – dww Dec 10 '17 at 05:03
  • @dww Sure, I added the dput output now – reallybadstatdude Dec 10 '17 at 05:09
  • 1
    Just do `dat[, Sum := sum(Inventory), by = Time]` if you want a data.table version. – David Arenburg Dec 10 '17 at 07:36

3 Answers3

1

You can use the ave function:

 (dat$Sum=ave(dat$Inventory,dat$Time,FUN=sum))
                      Time Inventory Sum
 1 08-01-2001 11:50:12             5   5
 2 08-01-2001 11:50:16             8   8
 3 08-01-2001 11:50:17             2   3
 4 08-01-2001 11:50:17             1   3
 5 08-01-2001 11:50:19            10  10
 6 08-01-2001 11:50:23             5  14
 7 08-01-2001 11:50:23             9  14

the data used:

 dat=read.table(text="
 Time                   |      Inventory
 08-01-2001 11:50:12    |       5
 08-01-2001 11:50:16    |       8
 08-01-2001 11:50:17    |       2
 08-01-2001 11:50:17    |       1
 08-01-2001 11:50:19    |       10
 08-01-2001 11:50:23    |       5
 08-01-2001 11:50:23    |       9",sep="|",h=T,stringsAsFactors=F)
Onyambu
  • 67,392
  • 3
  • 24
  • 53
  • Thanks! It's a bit of an odd issue since both of you gave me solutions at the exact same time, I'm voting your answer as the one that solved it since yours works a fair bit faster (microseconds compared to milliseconds on my laptop using the microbenchmark package). – reallybadstatdude Dec 10 '17 at 05:25
  • Wow I did not know that. Thanks. Well this is a base R version, I would have thought the `data.table` would have run much faster than this!! so I guess sometimes base r works faster than other packages – Onyambu Dec 10 '17 at 06:09
  • Yeah that's what I thought as well - I almost didn't try your version since it was using base R and I automatically figured that data.table would be faster, but microbenchmark says that your version is quite a bit quicker which I found quite unusual (but I'm definitely not complaining!) – reallybadstatdude Dec 10 '17 at 06:22
  • 1
    @reallybadstatdude Can you share your benchamrk? I have great doubts `ave` will beat data.table. – David Arenburg Dec 10 '17 at 07:36
1

You can use

dat[, Sum := sum(Inventory), by = Time]
#                   Time Inventory Sum
# 1: 08-01-2001 11:50:12         5   5
# 2: 08-01-2001 11:50:16         8   8
# 3: 08-01-2001 11:50:17         2   3
# 4: 08-01-2001 11:50:17         1   3
# 5: 08-01-2001 11:50:19        10  10
# 6: 08-01-2001 11:50:23         5  14
# 7: 08-01-2001 11:50:23         9  14
dww
  • 30,425
  • 5
  • 68
  • 111
  • 1
    Why not just `dat[, Sum := sum(Inventory), by = Time]`? – David Arenburg Dec 10 '17 at 07:34
  • I even never saw this. @David. Your code works perfectly fine. And if I was to use Data table, I guess this would be the first to come to mind, though I am still not comfortable with datatable – Onyambu Dec 10 '17 at 07:51
0

Here is how you could do it with xts:

# create xts object:

dat=read.table(text="
 Time                   |      Inventory
 08-01-2001 11:50:12    |       5
 08-01-2001 11:50:16    |       8
 08-01-2001 11:50:17    |       2
 08-01-2001 11:50:17    |       1
 08-01-2001 11:50:19    |       10
 08-01-2001 11:50:23    |       5
 08-01-2001 11:50:23    |       9",sep="|",h=T,stringsAsFactors=F)

dat[, "Time2"] <- as.POSIXct(dat[, "Time"], format = "%d-%m-%Y %H:%M:%S")
x <- xts(dat[, "Inventory"], order.by = dat[, "Time2"])

sum_with_inv <- function(y) {
  Sum <- sum(y)
  m <- merge(y, Sum)
  m
}



y <- do.call(rbind, lapply(split(x, f = "seconds"), sum_with_inv))
y
#                      y Sum
# 2001-01-08 11:50:12  5   5
# 2001-01-08 11:50:16  8   8
# 2001-01-08 11:50:17  2   3
# 2001-01-08 11:50:17  1   3
# 2001-01-08 11:50:19 10  10
# 2001-01-08 11:50:23  5  14
# 2001-01-08 11:50:23  9  14
FXQuantTrader
  • 6,821
  • 3
  • 36
  • 67