SQL Selecting rows at varying intervals

Question

I've got a situation where I have a huge table, containing a huge number of rows, which looks like (for example):

id          Timestamp               Value
14574499    2011-09-28 08:33:32.020 99713.3000
14574521    2011-09-28 08:33:42.203 99713.3000
14574540    2011-09-28 08:33:47.017 99713.3000
14574559    2011-09-28 08:38:53.177 99720.3100
14574578    2011-09-28 08:38:58.713 99720.3100
14574597    2011-09-28 08:39:03.590 99720.3100
14574616    2011-09-28 08:39:08.950 99720.3100
14574635    2011-09-28 08:39:13.793 99720.3100
14574654    2011-09-28 08:39:19.063 99720.3100
14574673    2011-09-28 08:39:23.780 99720.3100
14574692    2011-09-28 08:39:29.167 99758.6400
14574711    2011-09-28 08:39:33.967 99758.6400
14574730    2011-09-28 08:39:40.803 99758.6400
14574749    2011-09-28 08:39:49.297 99758.6400

Ok, so the rules are: The timestamps can be any n number of seconds apart, 5s, 30s, 60s etc, it varies depending on how old the record is (archiving takes place).

I want to be able to query this table to select each nth row based on the timestamp.

So for example:

Select * from mytable where intervalBetweenTheRows = 30s

(for the purposes of this question, based on the presumption the interval requested is always to a higher precision than available in the database)

So, every nth row based on the time between each row

Any ideas?!

Karl

For those of you who are interested, recursive CTE was actually quite slow, I thought of a slightly different method:

SELECT TOP 500
    MIN(pvh.[TimeStamp]) as [TimeStamp],
    AVG(pvh.[Value]) as [Value]
FROM
    PortfolioValueHistory pvh
WHERE
    pvh.PortfolioID = @PortfolioID
    AND pvh.[TimeStamp] >= @StartDate
    AND pvh.[TimeStamp] <= @EndDate
GROUP BY
    FLOOR(DateDiff(Second, '01/01/2011 00:00:00', pvh.[TimeStamp]) / @ResolutionInSeconds)
ORDER BY 
    [TimeStamp] ASC

I take the timestamp minus an arbitrary date to give a base int to work with, then floor and divide this by my desired resolution, I then group by this, taking the min timestamp (the first of that 'region' of stamps) and the average value for that 'period'.

This is used to plot a graph of historical data, so the average value does me fine.

This was the fastest execution based on the table size that I could come up with

Thanks for your help all.

Please explain further. You want the 30s to be a **minimum** interval between returned rows? Or an exact interval? — JNK, Oct 10 '11 at 13:36
When you say "every nth row" you make it sound like you want rows 2, 4, 6, 8, etc. But really you just want to discard all records that happen within N seconds of the last returned record? Or perhaps, the first record after 30s, and the first records after 60s, and the first record after 90s, etc? (Perhaps you could give an example result set that you're looking for?) Also, can I make these assumptions? No two rows have the same Timestamp? And/Or that a records with a higher `id` CAN'T happen BEFORE a record with a lower `id`? — MatBailie, Oct 10 '11 at 13:38
The "NO CURSORS" comment is a given. We don't do no cursin cursors no more. — Raj More, Oct 10 '11 at 13:39

Martin Smith · Accepted Answer · 2011-10-10T15:36:49.320

3

Assuming that the requirement is that the determinant for whether a row is returned or not depends on the time elapsed from the previous returned row this needs a procedural approach. Recursive CTEs might be a bit more efficient than a cursor though.

WITH RecursiveCTE
     AS (SELECT TOP 1 *
         FROM @T
         ORDER BY [Timestamp]
         UNION ALL
         SELECT id,
                [Timestamp],
                Value
         FROM   (
                --Can't use TOP directly
                SELECT T.*,
                       rn = ROW_NUMBER() OVER (ORDER BY T.[Timestamp])
                 FROM   @T T
                        JOIN RecursiveCTE R
                          ON T.[Timestamp] >=
                                 DATEADD(SECOND, 30, R.[Timestamp])) R
         WHERE  R.rn = 1)
SELECT *
FROM RecursiveCTE

edited Oct 10 '11 at 15:36

answered Oct 10 '11 at 13:49

Martin Smith

438,706
87
741
845

Note : SQL Server has a maximum recursion depth, unless specified otherwise. Question: If you set MAXRECURSION to 0, how do recursive CTE's perform over large numbers of itteration? – MatBailie Oct 10 '11 at 13:53
@Dems - Depends on indexes but generally not great but better than cursors or triangular joins. [See my running totals answer here for some figures](http://stackoverflow.com/questions/7357516/subquery-or-leftjoin-with-group-by-which-one-is-faster) – Martin Smith Oct 10 '11 at 13:56
Could this be sensibly modified such that the intial records are ones that do not have any preceding records within 30seconds? Effectively itterating though multiple different 'blocks' of data concurrently? And would there be any benefit in doing so? (Due to the reduced recursion depth?) – MatBailie Oct 10 '11 at 13:57
@Dems If `[Timestamp]` is indexed then that wouldn't reduce the number of seeks – Martin Smith Oct 10 '11 at 14:03
MArtin, please read my comment and code further down the page – Stono Oct 10 '11 at 15:28
@user987768 - Delete the `PARTITION BY` it is not needed here, I re-used a bit of code from previously and forgot to remove it! – Martin Smith Oct 10 '11 at 15:36

score 1 · Answer 2 · answered Oct 10 '11 at 14:34

This isn't as elegant as Martin S's CTE, but instead uses interpolation on predefined sample points to get the first sample in between each pair of sampling times. If there is no sample in a period then no record is returned.

DECLARE @SampleTime DATETIME
DECLARE @NumberSamples INT
DECLARE @SampleInterval INT

SET @SampleTime = '2011-09-28 08:33:32.020' -- Start time
SET @NumberSamples = 20 -- Or however many sample intervals you need to evaluate
SET @SampleInterval = 30 -- Seconds

CREATE TABLE #tmpTimesToSample
(
    SampleID INT,
    SampleTime DATETIME NULL
)

-- Works out the time intervals, 0 to 19
INSERT INTO #tmpTimesToSample(SampleID, SampleTime)
SELECT TOP (@NumberSamples)
    sv.number,
    DATEADD(ss, sv.number * @SampleInterval, @SampleTime)
FROM
    master..spt_values sv
WHERE 
    type = 'p'
ORDER BY
    sv.number ASC

-- Now interpolate these sample intervals back into the data table
SELECT ID, [TimeStamp], Value
FROM
(
    SELECT mt.Id, mt.[TimeStamp], mt.Value, row_number() over (partition by tmp.SampleID order by tmp.SampleID) as RowNum
    FROM #tmpTimesToSample tmp RIGHT OUTER JOIN MyTable mt
    on mt.[TimeStamp] BETWEEN tmp.SampleTime and DATEADD(ss, @SampleInterval, tmp.SampleTime)
) x
WHERE x.RowNum = 1 -- Only want the first sample in each bin

DROP TABLE #tmpTimesToSample

Test data:

CREATE TABLE MyTable
(
    ID BIGINT,
    [TimeStamp] DATETIME,
    [Value] DECIMAL(18,4)
)
GO

insert into MyTable values(14574499, '2011-09-28 08:33:32.020', 99713.3000)
insert into MyTable values(14574521    ,'2011-09-28 08:33:42.203',  99713.3000)
insert into MyTable values(14574540    ,'2011-09-28 08:33:47.017', 99713.3000)
insert into MyTable values(14574559    ,'2011-09-28 08:38:53.177', 99720.3100)
insert into MyTable values(14574578    ,'2011-09-28 08:38:58.713', 99720.3100)
insert into MyTable values(14574597    ,'2011-09-28 08:39:03.590', 99720.3100)
insert into MyTable values(14574616    ,'2011-09-28 08:39:08.950', 99720.3100)
insert into MyTable values(14574635    ,'2011-09-28 08:39:13.793', 99720.3100)
insert into MyTable values(14574654    ,'2011-09-28 08:39:19.063', 99720.3100)
insert into MyTable values(14574673    ,'2011-09-28 08:39:23.780', 99720.3100)
insert into MyTable values(14574692    ,'2011-09-28 08:39:29.167', 99758.6400)
insert into MyTable values(14574711    ,'2011-09-28 08:39:33.967', 99758.6400)
insert into MyTable values(14574730    ,'2011-09-28 08:39:40.803', 99758.6400)
insert into MyTable values(14574749    ,'2011-09-28 08:39:49.297', 99758.6400)
go

score 0 · Answer 3 · answered Oct 10 '11 at 13:35

0

This will give you all rows that have a 30 millisecond interval to the next row. Both rows will be side by side.

Select T1.*, T2.*
From MyTable T1
    Inner Join MyTable T2
        On DateDiff (millisecond, T1.Value, T2.Value) = 30

answered Oct 10 '11 at 13:35

Raj More

47,048
33
131
198

30 milliseconds? I don't think this will return any rows at all as written. – JNK Oct 10 '11 at 13:37
1

Won't that return every row where its datediff is 30 millisec from another row? So row #2 would be returned if it's datediff is 30 millisec from any other row, not just if it was 30 millisec after row #1. – Dave D Oct 10 '11 at 13:37
1

Also, isn't this extremely index unfriendly? Would it not be better to re-arrange the predicate? `ON T1.Value = DATEADD(SECOND, 30, T2.Value)` – MatBailie Oct 10 '11 at 13:46

SQL Selecting rows at varying intervals

3 Answers3

Linked