How do you get the MAX() and SUM() of values between STRING values?

Question

I have data that looks like this:

metric_date      location  id    value
20/02/07 13:00   ATL       A      34
20/02/07 13:05   ATL       B      12
20/02/07 13:10   ATL       B      02
20/02/07 13:15   ATL       A      15
20/02/07 13:20   ATL       A      00         
20/02/07 13:25   ATL       A      00
20/02/07 13:30   ATL       A      12
20/02/07 13:35   ATL       B      12
20/02/07 13:40   ATL       A      23
20/02/07 13:45   ATL       B      03
20/02/07 13:50   ATL       A      00
20/02/07 13:55   ATL       A      00

I need to find max(value) and -SUM(value) where 'id' is "B"- of each section between the zero-value columns to get SUM()/MAX() = success_rate

I tried:

SELECT 
      CASE
       WHEN DATE(metric_date) = lag(DATE(metric_date), 1) OVER (ORDER BY DATE(metric_date)) 
            AND building = lag(building, 1) OVER (ORDER BY date)
       THEN 1
      END AS work_period
    , CASE
        WHEN LAG(value, 1) OVER (ORDER BY date) = 0
             AND LEAD(value, 1) OVER (ORDER BY date) > 0
        THEN LAG(work_period, 1) + 1
        WHEN LAG(SUM(metric_value), 1) OVER (ORDER BY metric_date) > 0
        THEN LAG(work_period, 1)
       END section

I need the results to look like this:

location  section   max   sum   success_rate
ATL         1       34    14    0.4118
ATL         2       23    15    0.6522

"*between the zero-value columns*" - between requires a sort order. But I can't see any column in your sample data that could be used to sort the rows so that the term "*between*" gets meaningful. — , Feb 28 '20 at 13:36
How do you get a max of "34"? That value is not even i the data. — Gordon Linoff, Feb 28 '20 at 14:12

score 0 · Answer 1 · answered Feb 28 '20 at 15:11

This is a Gaps and Islands problem (article is for SQL Server but applies equally to postgresql).

The following should solve your problem

SELECT  Location,
        MAX(Value) AS Max,
        SUM(CASE WHEN id = 'B' THEN Value END) AS Sum,
        1.0 * SUM(CASE WHEN id = 'B' THEN Value END) / MAX(Value) AS SuccesRate
FROM    (   SELECT  *,
                    ROW_NUMBER() OVER(PARTITION BY Location, CASE WHEN Value = 0 THEN 1 ELSE 0 END ORDER BY metric_date) - 
                        ROW_NUMBER() OVER(PARTITION BY Location ORDER BY metric_date) AS GroupingSet
            FROM    T
        ) AS t
WHERE   Value <> 0
GROUP BY Location, GroupingSet;

The key is generating a field to group by to identify the islands, which can be done by allocating two row_numbers to each row:

SELECT  *,
        ROW_NUMBER() OVER(PARTITION BY Location, CASE WHEN Value = 0 THEN 1 ELSE 0 END ORDER BY metric_date) AS RowNumInSubset,
        ROW_NUMBER() OVER(PARTITION BY Location ORDER BY metric_date) AS RowNumInSet
FROM    #T
ORDER BY metric_date

This produces the following:

metric_date         location    id      value   RowNumInSubset  RowNumInSet 
----------------------------------------------------------------------------
2020-02-07 13:00    ATL         A       34          1               1       
2020-02-07 13:05    ATL         B       12          2               2       
2020-02-07 13:10    ATL         B       2           3               3       
2020-02-07 13:15    ATL         A       15          4               4       
2020-02-07 13:20    ATL         A       0           1               5       
2020-02-07 13:25    ATL         A       0           2               6       
2020-02-07 13:30    ATL         A       12          5               7       
2020-02-07 13:35    ATL         B       12          6               8       
2020-02-07 13:40    ATL         A       23          7               9       
2020-02-07 13:45    ATL         B       3           8               10      
2020-02-07 13:50    ATL         A       0           3               11      
2020-02-07 13:55    ATL         A       0           4               12

Then, by deducting the RowNumInSet from the RowNumInSubset, you will produce a constant for your islands:

metric_date         location    id      value   RowNumInSubset  RowNumInSet GroupingSet
------------------------------------------------------------------------------------
2020-02-07 13:00    ATL         A       34          1               1           0
2020-02-07 13:05    ATL         B       12          2               2           0
2020-02-07 13:10    ATL         B       2           3               3           0
2020-02-07 13:15    ATL         A       15          4               4           0
------------------------------------------------------------------------------------
2020-02-07 13:20    ATL         A       0           1               5           -4
2020-02-07 13:25    ATL         A       0           2               6           -4
------------------------------------------------------------------------------------
2020-02-07 13:30    ATL         A       12          5               7           -2
2020-02-07 13:35    ATL         B       12          6               8           -2
2020-02-07 13:40    ATL         A       23          7               9           -2
2020-02-07 13:45    ATL         B       3           8               10          -2
------------------------------------------------------------------------------------
2020-02-07 13:50    ATL         A       0           3               11          -8
2020-02-07 13:55    ATL         A       0           4               12          -8

Then finally, you can remove the rows where value = 0, as these are just break points:

metric_date         location    id      value   RowNumInSubset  RowNumInSet GroupingSet
------------------------------------------------------------------------------------
2020-02-07 13:00    ATL         A       34          1               1           0
2020-02-07 13:05    ATL         B       12          2               2           0
2020-02-07 13:10    ATL         B       2           3               3           0
2020-02-07 13:15    ATL         A       15          4               4           0
------------------------------------------------------------------------------------
2020-02-07 13:30    ATL         A       12          5               7           -2
2020-02-07 13:35    ATL         B       12          6               8           -2
2020-02-07 13:40    ATL         A       23          7               9           -2
2020-02-07 13:45    ATL         B       3           8               10          -2

Then you can perform your aggregate on each group.

Example on DB<>Fiddle

Erwin Brandstetter · Answer 2 · 2020-02-28T16:55:04.387

Based on a couple of assumption where the question does not specify, this query produces your desired result exactly:

SELECT min(location) AS location
     , row_number() OVER (ORDER BY grp) AS section
     , max(value) AS max
     , sum(value) FILTER (WHERE id = 'B') AS sum
     , round(sum(value) FILTER (WHERE id = 'B')
           / max(value)::numeric, 4) AS success_rate
FROM (
   SELECT *, count(*) FILTER (WHERE value = 0) OVER (ORDER BY metric_date) AS grp
   FROM   tbl
   ) sub
WHERE  value <> 0
GROUP  BY grp;

db<>fiddle here

In particular, not grouping by location - which might make sense ...

Detailed explanation in many related answers:

For maximum performance consider a procedural solution in this particular case (typically, set-based solutions are faster), as that can make do with a single sequential scan over the table. Like:

GROUP BY and aggregate sequential numeric values

How do you get the MAX() and SUM() of values between STRING values?

2 Answers2