3

I have a requirement where I need to know when the sum(value) reaches certain point and calculate duration. Below is the sample table.

create table sample (dt timestamp, value real);

insert into sample values
     ('2019-01-20 00:29:43 ',0.29)
    ,('2019-01-20 00:35:06 ',0.31)
    ,('2019-01-20 00:35:50 ',0.41)
    ,('2019-01-20 00:36:32 ',0.26)
    ,('2019-01-20 00:37:20 ',0.33)
    ,('2019-01-20 00:41:30 ',0.42)
    ,('2019-01-20 00:42:28 ',0.35)
    ,('2019-01-20 00:43:14 ',0.52)
    ,('2019-01-20 00:44:18 ',0.25);

Now my requirement is to calculate the cumulative sum of following rows to see when the sum(value) reaches above 1.0. That can require just 1 row or n rows. Once that row is reached, I need to calculate time difference between current row and the row where sum(value) reaches above 1.0.

Essentially my desired output is in below format.
For the 1st row, cumulative sum(value) is reached at the 3rd row.
For the 2nd row, cumulative sum(value) is reached at the 4th row etc.

         dt         | value | sum(value)| time_at_sum(value)_1| Duration
---------------------+--------+------------------------------------------
 2019-01-20 00:29:43| 0.29  |   1.01    | 2019-01-20 00:35:50 | 00:06:07
 2019-01-20 00:35:06| 0.31  |   1.31    | 2019-01-20 00:37:20 | 00:02:14 
 2019-01-20 00:35:50| 0.41  |   1.00    | 2019-01-20 00:37:20 | 00:01:30 
 2019-01-20 00:36:32| 0.26  |   1.01    | 2019-01-20 00:41:30 | 00:04:58 
 2019-01-20 00:37:20| 0.33  |   1.10    | 2019-01-20 00:42:28 | 00:05:08 
 2019-01-20 00:41:30| 0.42  |   1.29    | 2019-01-20 00:43:14 | 00:01:44 
 2019-01-20 00:42:28| 0.35  |   1.12    | 2019-01-20 00:44:18 | 00:01:50 
 2019-01-20 00:43:14| 0.52  |   NULL    |  -                  | -
 2019-01-20 00:44:18| 0.25  |   NULL    |  -                  | -

Anyone has ideas or pointers on how to deal with above requirement?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
acul
  • 49
  • 5
  • And what does `sum(value)` have to do with the data? – Gordon Linoff Feb 28 '19 at 01:26
  • Its cumulative sum of value column until it reaches 1.0.. Just showing expected output.. – acul Feb 28 '19 at 01:28
  • The data you have shown for `value` does not obviously sum to the `sum(value)` examples. Can you edit the description to show the actual computation that leads from `value` to `sum(value)` in your example? – bignose Feb 28 '19 at 01:45
  • Hi @bignose, I have edited my request. I manually populated the desired output column. Basically Sum(value) is the cumulative sum of next n number of recorods until it reaches value 1 – acul Feb 28 '19 at 02:20
  • `1.00` for `sum(value)` in your result violates your condition `when the sum(value) reaches above 1.0`, which tranlates to `> 1.0`, not `>= 1.0`. – Erwin Brandstetter Feb 28 '19 at 03:32

2 Answers2

4
WITH tmp AS (
    SELECT *
        , sum(value) OVER (ORDER BY dt rows between current row and unbounded following) as forward_sum
    FROM sample
    ORDER BY dt)
SELECT t1.dt, t1.value
    , (t2.value + t1.forward_sum - t2.forward_sum) as "sum(value)"
    , t2.dt as "time_at_sum(value)_1" 
    , t2.dt - t1.dt as "Duration"
FROM tmp t1
LEFT JOIN LATERAL (
    SELECT * 
    FROM tmp t
    WHERE t1.forward_sum - t.forward_sum < 1
        AND (t.value + t1.forward_sum - t.forward_sum) >= 0.999
    ORDER BY dt DESC 
    LIMIT 1
    ) t2
ON TRUE

yields

| dt                  | value | sum(value) | time_at_sum(value)_1 | Duration |
|---------------------+-------+------------+----------------------+----------|
| 2019-01-20 00:29:43 |  0.29 |       1.01 | 2019-01-20 00:35:50  | 00:06:07 |
| 2019-01-20 00:35:06 |  0.31 |       1.31 | 2019-01-20 00:37:20  | 00:02:14 |
| 2019-01-20 00:35:50 |  0.41 |          1 | 2019-01-20 00:37:20  | 00:01:30 |
| 2019-01-20 00:36:32 |  0.26 |       1.01 | 2019-01-20 00:41:30  | 00:04:58 |
| 2019-01-20 00:37:20 |  0.33 |        1.1 | 2019-01-20 00:42:28  | 00:05:08 |
| 2019-01-20 00:41:30 |  0.42 |       1.29 | 2019-01-20 00:43:14  | 00:01:44 |
| 2019-01-20 00:42:28 |  0.35 |       1.12 | 2019-01-20 00:44:18  | 00:01:50 |
| 2019-01-20 00:43:14 |  0.52 |            |                      |          |
| 2019-01-20 00:44:18 |  0.25 |            |                      |          |

First compute a cumulative sum over the value column:

SELECT *
    , sum(value) OVER (ORDER BY dt rows between current row and unbounded following) as forward_sum
FROM sample
ORDER BY dt

which yields

| dt                  | value | forward_sum |
|---------------------+-------+-------------|
| 2019-01-20 00:29:43 |  0.29 |        3.14 |
| 2019-01-20 00:35:06 |  0.31 |        2.85 |
| 2019-01-20 00:35:50 |  0.41 |        2.54 |
| 2019-01-20 00:36:32 |  0.26 |        2.13 |
| 2019-01-20 00:37:20 |  0.33 |        1.87 |
| 2019-01-20 00:41:30 |  0.42 |        1.54 |
| 2019-01-20 00:42:28 |  0.35 |        1.12 |
| 2019-01-20 00:43:14 |  0.52 |        0.77 |
| 2019-01-20 00:44:18 |  0.25 |        0.25 |

Notice that subtracting two values from forward_sum corresponds to a partial sum over values. For example,

0.29 + 0.31 + 0.41 = 3.14 - 2.13

So the difference of forward_sums are going to play an important role, and we'll want to compare these differences to 1. We're going to want to join this table with itself, using a join condition like:

t1.forward_sum - t.forward_sum < 1

Let's see what happens if we use LEFT JOIN LATERAL. The critical thing to know about LEFT JOIN LATERAL is that the subquery to the right of a LATERAL join has to be evaluated once for each row in the table on the left:

WITH tmp AS (
    SELECT *
        , sum(value) OVER (ORDER BY dt rows between current row and unbounded following) as forward_sum
    FROM sample
    ORDER BY dt)
SELECT t1.*, t2.*
FROM tmp t1
LEFT JOIN LATERAL (
    SELECT * 
    FROM tmp t
    WHERE t1.forward_sum - t.forward_sum < 1
    ORDER BY dt DESC 
    LIMIT 1
    ) t2
ON TRUE

yields

| dt                  | value | forward_sum | dt                  | value | forward_sum |
|---------------------+-------+-------------+---------------------+-------+-------------|
| 2019-01-20 00:29:43 |  0.29 |        3.14 | 2019-01-20 00:35:50 |  0.41 |        2.54 |
| 2019-01-20 00:35:06 |  0.31 |        2.85 | 2019-01-20 00:37:20 |  0.33 |        1.87 |
| 2019-01-20 00:35:50 |  0.41 |        2.54 | 2019-01-20 00:37:20 |  0.33 |        1.87 |
| 2019-01-20 00:36:32 |  0.26 |        2.13 | 2019-01-20 00:41:30 |  0.42 |        1.54 |
| 2019-01-20 00:37:20 |  0.33 |        1.87 | 2019-01-20 00:42:28 |  0.35 |        1.12 |
| 2019-01-20 00:41:30 |  0.42 |        1.54 | 2019-01-20 00:43:14 |  0.52 |        0.77 |
| 2019-01-20 00:42:28 |  0.35 |        1.12 | 2019-01-20 00:44:18 |  0.25 |        0.25 |
| 2019-01-20 00:43:14 |  0.52 |        0.77 | 2019-01-20 00:44:18 |  0.25 |        0.25 |
| 2019-01-20 00:44:18 |  0.25 |        0.25 | 2019-01-20 00:44:18 |  0.25 |        0.25 |

Notice that we've guessed our way to the join condition which matches the desired dates. Now it's just a matter of composing the right value expressions to obtain the desired columns, sum(value), time_at_sum(value)_1.

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Added your smart query to my fiddle - for testing convenience: https://dbfiddle.uk/?rdbms=postgres_11&fiddle=e774e0003218d5142ba5db8def8be8a0 – Erwin Brandstetter Feb 28 '19 at 04:08
  • This is great. Thanks a lot for your helpl Unutbu. I was cracking my head for couple of days over this problem. – acul Feb 28 '19 at 04:49
2

A way to solve this efficiently is a procedural solution with two cursors: One explicit cursor and another implicit cursor of the FOR loop:

CREATE OR REPLACE FUNCTION foo()
  RETURNS TABLE (dt timestamp
               , val real
               , sum_value real
               , time_at_sum timestamp
               , duration interval) AS
$func$
DECLARE
   _bound real := 1.0;          -- your bound here
   cur CURSOR FOR SELECT * FROM sample s ORDER BY s.dt; -- in chronological order
   s sample;                    -- cursor row 
BEGIN
   OPEN cur;
   FETCH cur INTO time_at_sum, sum_value; -- fetch first row into target

   FOR dt, val IN  -- primary pass over table
      SELECT x.dt, x.value FROM sample x ORDER BY s.dt
   LOOP
      WHILE sum_value <= _bound LOOP
         FETCH cur INTO s;
         IF NOT FOUND THEN  -- end of table
            sum_value := NULL; time_at_sum := NULL;
            EXIT;           -- exits inner loop
         END IF;
         sum_value := sum_value + s.value; 
      END LOOP;
      IF sum_value > _bound THEN  -- to catch end-of-table
         time_at_sum := s.dt;
      END IF;   
      duration := time_at_sum - dt;
      RETURN NEXT;
      sum_value := sum_value - val;  -- subtract previous row before moving on
   END LOOP;
END
$func$  LANGUAGE plpgsql;

Call:

SELECT * FROM foo();

db<>fiddle here

Should perform nicely since it only needs 2 scans over the table.

Note that I implemented > _bound like your description requires, not >= _bound like your result indicates. Easy to change either way.

Assumes the value column to be NOT NULL.

Related:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • @acul: I would be very interested in actual times for the function as compared to unutbu's query (along with the number of rows in your table and a rough avg of rows to be aggregated per `sum_value`). – Erwin Brandstetter Mar 01 '19 at 16:38
  • Hi @Erwin Brandstetter: I will get error saying "ERROR: value out of range: overflow" on both solutions.. – acul Mar 04 '19 at 06:34
  • Is your column `value` in fact defined as `real`? What is the true value of `_bound` in your error case (1.0 in the example) and what are min and max of value? You must have *huge* values for `value`, the range for `real` is typically from at least 1E-37 to 1E+37. (Or you introduced an independent problem somehow.) Maybe start a new question with all relevant details. – Erwin Brandstetter Mar 04 '19 at 12:43
  • 1
    Hi @Erwin Brandstetter:, yes, you are right. My bad, didnt look the value range when copied from parent table. I fixed that problem. When I run on table with ~ 55K records, your solution runs ~ 4 secs where as SQL is running for ~ 6 mins. Thats a big difference and time difference drastically increasing as the table size grows. Thanks Erwin. Its clear on which way to go – acul Mar 05 '19 at 04:01