How to find max or min of a column in a data.table up to a given time in a day

Question

I've got some financial time series in one-minute intervals ("bars"). Because I'll be dealing with some large series, I'm learning to use data.table, and I'm not quite used to thinking in the data.table way yet.
In the data I've imported into a data.table (let's call it DT here for simplicity), among other things not relevant to this question, I've got a column called Date in YYYYMMDD format, a column called Time in HH:MM format, and a column called Price (let's say it's the final price for that minute) that's an integer. I'm simplifying a little, but if anyone can help me with the question I'm asking here, I can adapt it to the details of my data. Here's an example:

Date     Time  Price
20151028 09:00 47675
20151028 09:01 47650
20151028 09:02 47670
20151028 09:03 47685
20151028 09:04 47690
...

What I'd like to do is calculate, for each minute, the maximum and minimum values of Price in the day up to that minute. Let's go ahead and add the things I want to calculate as additional columns, which I'll call DayMax and DayMin here. The result would then look like this:

Date     Time  Price DayMax DayMin
20151028 09:00 47675 47675  47675
20151028 09:01 47650 47675  47650
20151028 09:02 47670 47675  47650
20151028 09:03 47685 47685  47650
20151028 09:04 47690 47690  47650
...

I'm pretty sure my by (which I understand as being like a GROUP BY in SQL) should be Date, and I think what I'm having trouble understanding is how to set up my i (like a WHERE in SQL), basically because I'm not sure how to use the same column two different ways in the syntax. For each row, I want to calculate the maximum where (and by "where," I think I mean WHERE) Time is less than or equal to the value of Time in that specific row, and where I'm GROUP(ing)BY Date, so it's up to that line's time in the specific day. How can I do this in a quick and memory-efficient way using data.table?
By the way, speaking of "quick and memory-efficient," I'm assuming I should use the := operator to create the new columns. Please correct me if I'm wrong about that.

Try `cummin` and `cummax`. If you want more specific help, you'll probably want to follow the site's guidance: https://stackoverflow.com/help/how-to-ask and http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example/28481250#28481250 — Frank, Mar 19 '17 at 05:39
@Frank, you are right, of course. I will make sure to provide R code to produce example data in future questions. Should I edit the question to add such code? — Phys-math-stat, Mar 19 '17 at 13:02
Nope, that's okay; looks like it was good enough for Uwe to answer already. — Frank, Mar 19 '17 at 14:06

score 2 · Accepted Answer · edited May 23 '17 at 12:17

2

As suggested by Frank, cummin() and cummax() will do that for you. (In SQL this is called window function or analytic function in ORACLE, I believe).

DT[order(Date, Time), `:=`(DayMax = cummax(Price), DayMin = cummin(Price)), by = Date]
DT
#       Date  Time Price DayMax DayMin
#1: 20151028 09:00 47675  47675  47675
#2: 20151028 09:01 47650  47675  47650
#3: 20151028 09:02 47670  47675  47650
#4: 20151028 09:03 47685  47685  47650
#5: 20151028 09:04 47690  47690  47650

Just to make sure the data are in the proper order, DT is ordered by Dateand Time. This can be skipped if this is ensured by previous steps. The two new columns are computed within one expression, separately for each Date as requested.

Using chaining, this could be rewritten as

res <- DT[order(Date, Time)][, DayMax := cummax(Price), by = Date][, DayMin := cummin(Price), by = Date][]

but this has the disadvantage that a copy of DT is created instead of DT being modified by reference plus requiring an additional grouping operation (Thanks to Frank for pointing this out).

Data

library(data.table)
DT <- fread ("Date     Time  Price
         20151028 09:00 47675
         20151028 09:01 47650
         20151028 09:02 47670
         20151028 09:03 47685
         20151028 09:04 47690")

edited May 23 '17 at 12:17

Community

1
1

answered Mar 19 '17 at 07:51

Uwe

41,420
11
90
134

Uwe Block, I had written a comment this morning thanking you for this, but I guess I forgot to click on "Add Comment." Thanks so much for this answer! As @Frank correctly pointed out in a comment on the question, I had failed to provide R code to produce the example data, but even so, you wrote an answer that was thorough and very useful to me. You even papered over my mistake by providing simple code to reproduce the example data! – Phys-math-stat Mar 20 '17 at 01:09
@Phys-math-stat No worries, I appreciate your carefully crafted Q which demonstrate your efforts to understand `data.table` syntax. – Uwe Mar 20 '17 at 06:59
Each programming language ends up requiring me to think in a different way. I have my own way of thinking about how to do things in R, but now that I've found `data.table`, everything is changing. I'm starting to get some of the syntax and the way of thinking, but it's not natural yet. I am sure it's worth learning, and I'm enjoying the learning process. Your help is greatly appreciated! – Phys-math-stat Mar 21 '17 at 13:06
@Phys-math-stat Perhaps you already have discovered https://github.com/Rdatatable/data.table/wiki/Getting-started. I found the vignettes including the FAQ very helpful in understanding the potential of `data.table`. Section 2.16 of the FAQ explaining how `data.table` syntax relates to `SQL helped me a lot. – Uwe Mar 21 '17 at 13:33
I had been seeing those things on rawgit, which is where the links from the github page point. The date on rawgit appears to be off by three months. For anyone else reading this who's trying to learn `data.table`, I recommend the vignettes linked by Uwe Block. They're really useful. Thanks again, UB! – Phys-math-stat Mar 21 '17 at 16:48

How to find max or min of a column in a data.table up to a given time in a day

1 Answers1

Data