How do I efficiently select the previous non-null value?

Question

I have a table in Postgres that looks like this:

# select * from p;
 id | value 
----+-------
  1 |   100
  2 |      
  3 |      
  4 |      
  5 |      
  6 |      
  7 |      
  8 |   200
  9 |          
(9 rows)

And I'd like to query to make it look like this:

# select * from p;
 id | value | new_value
----+-------+----------
  1 |   100 |    
  2 |       |    100
  3 |       |    100
  4 |       |    100
  5 |       |    100
  6 |       |    100
  7 |       |    100
  8 |   200 |    100
  9 |       |    200
(9 rows)

I can already do this with a subquery in the select, but in my real data I have 20k or more rows and it gets to be quite slow.

Is this possible to do in a window function? I'd love to use lag(), but it doesn't seem to support the IGNORE NULLS option.

select id, value, lag(value, 1) over (order by id) as new_value from p;
 id | value | new_value
----+-------+-----------
  1 |   100 |      
  2 |       |       100
  3 |       |      
  4 |       |
  5 |       |
  6 |       |
  7 |       |
  8 |   200 |
  9 |       |       200
(9 rows)

score 157 · Accepted Answer · edited May 23 '17 at 12:02

157

I found this answer for SQL Server that also works in Postgres. Having never done it before, I thought the technique was quite clever. Basically, he creates a custom partition for the windowing function by using a case statement inside of a nested query that increments a sum when the value is not null and leaves it alone otherwise. This allows one to delineate every null section with the same number as the previous non-null value. Here's the query:

SELECT
  id, value, value_partition, first_value(value) over (partition by value_partition order by id)
FROM (
  SELECT
    id,
    value,
    sum(case when value is null then 0 else 1 end) over (order by id) as value_partition

  FROM p
  ORDER BY id ASC
) as q

And the results:

 id | value | value_partition | first_value
----+-------+-----------------+-------------
  1 |   100 |               1 |         100
  2 |       |               1 |         100
  3 |       |               1 |         100
  4 |       |               1 |         100
  5 |       |               1 |         100
  6 |       |               1 |         100
  7 |       |               1 |         100
  8 |   200 |               2 |         200
  9 |       |               2 |         200
(9 rows)

edited May 23 '17 at 12:02

Community

1
1

answered Sep 25 '13 at 18:19

adamlamar

4,629
2
27
22

4

+1 ...This is a really clever solution, also an index on `id` and `value` will improve the performance. – MatheusOl Sep 25 '13 at 19:41
23

you can use `count(value)` rather than `sum(case...)` as count ignores nulls. – shaunc Sep 04 '16 at 07:07
Fantastic find. This works great in Netezza as well. – knslyr May 15 '19 at 18:38
1

This also works for Presto (or Athena) as they don't support the IGNORE NULLS clause for LAG either. – Nathan Griffiths Nov 06 '19 at 20:29
It is March 2021. Athena still doesn't support IGNORE NULLs and this solution came handy. So a massive Thank You! – Nick Scriabin Mar 31 '21 at 13:26
1

Is there a more "elegant" way to do this for a query with multiple columns that use their previous non null values, rather than having individual window functions for each column? – waydegg Jul 28 '21 at 17:56

Slobodan Pejic · Answer 2 · 2017-02-16T16:54:50.050

19

You can create a custom aggregate function in Postgres. Here's an example for the int type:

CREATE FUNCTION coalesce_agg_sfunc(state int, value int) RETURNS int AS
$$
    SELECT coalesce(value, state);
$$ LANGUAGE SQL;

CREATE AGGREGATE coalesce_agg(int) (
    SFUNC = coalesce_agg_sfunc,
    STYPE  = int);

Then query as usual.

SELECT *, coalesce_agg(b) over w, sum(b) over w FROM y
  WINDOW w AS (ORDER BY a);

a b coalesce_agg sum 
- - ------------ ---
a 0            0   0
b ∅            0   0
c 2            2   2
d 3            3   5
e ∅            3   5
f 5            5  10
(6 rows)

edited Feb 16 '17 at 16:54

answered Jun 15 '16 at 21:54

Slobodan Pejic

522
3
9

1

You can use to pseudo-type `anyelement` instead of `int` to make the function and aggregate function generic. – Slobodan Pejic Jun 15 '16 at 22:00
Calling it `coalesce_agg` seems more intuitive. – ddrscott Dec 22 '16 at 20:59
`coalesce_agg` does seem better. Though to nitpick, `coalesce` prefers the first argument whereas this aggregate function prefers the last value. – Slobodan Pejic Dec 22 '16 at 23:01

score 5 · Answer 3 · answered Sep 24 '13 at 19:13

5

Well, I can't guarantee this is the most efficient way, but works:

SELECT id, value, (
    SELECT p2.value
    FROM p p2
    WHERE p2.value IS NOT NULL AND p2.id <= p1.id
    ORDER BY p2.id DESC
    LIMIT 1
) AS new_value
FROM p p1 ORDER BY id;

The following index can improve the sub-query for large datasets:

CREATE INDEX idx_p_idvalue_nonnull ON p (id, value) WHERE value IS NOT NULL;

Assuming the value is sparse (e.g. there are a lot of nulls) it will run fine.

answered Sep 24 '13 at 19:13

MatheusOl

10,870
3
30
28

Thanks! This is effectively what I already have in terms of a subquery. Definitely works. I didn't realize Postgres allowed you to create an index and combine it with a condition. Thats pretty awesome. – adamlamar Sep 25 '13 at 18:04

score 2 · Answer 4 · answered Jun 24 '21 at 11:28

Another possibility is to built a sum:

WITH CTE_Data(Company, ValueDate, Amount)
AS(
    SELECT 'Company', '2021-05-01', 1000    UNION
    SELECT 'Company', '2021-05-02', 1250    UNION
    SELECT 'Company', '2021-05-03', NULL    UNION
    SELECT 'Company', '2021-05-04', NULL    UNION
    SELECT 'Company', '2021-05-05', 7500    UNION
    SELECT 'Company', '2021-05-06', NULL    UNION
    SELECT 'Company', '2021-05-07', 3200    UNION
    SELECT 'Company', '2021-05-08', 3400    UNION
    SELECT 'Company', '2021-05-09', NULL    UNION
    SELECT 'Company', '2021-05-10', 7800
)

SELECT 
     d.[Company]
    ,d.[ValueDate]
    ,d.[Amount]
    ,d.[Partition]
    ,SUM(d.[Amount]) OVER(PARTITION BY d.[Company], d.[Partition]) AS [Missing]
FROM(
    SELECT
         d.[Company]
        ,d.[ValueDate]
        ,d.[Amount]
        ,SUM(CASE WHEN d.[Amount] IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY d.[Company] ORDER BY d.[ValueDate]) AS [Partition]
    FROM CTE_Data AS d 
) AS d

Lukasz Szozda · Answer 5 · 2023-03-06T19:28:09.443

It is possible to emulate LAG(...)/LEAD(...) IGNORE NULLS with ARRAY_AGG and accessing last NOT NULL value:

WITH cte AS (
  SELECT *,
    ARRAY_REMOVE((ARRAY_AGG(value) OVER(/*PARTITION BY group*/ ORDER BY id)), NULL) 
       AS value_arr
  FROM tab
)
SELECT id, value,
    value_arr[array_upper(value_arr, 1)] AS new_value,
    value_arr -- debug
FROM cte
ORDER BY id;

For input data:

CREATE TABLE tab(id INT, value INT);
INSERT INTO tab(id, value)
VALUES (1,100),(2,NULL),(3,NULL),(4,NULL),(5,NULL),
       (6, NULL),(7,NULL),(8,200),(9,NULL);

Condensed version:

SELECT *,
    (ARRAY_REMOVE((ARRAY_AGG(value) OVER( ORDER BY id)), NULL))
    [ARRAY_UPPER(ARRAY_REMOVE((ARRAY_AGG(value) OVER( ORDER BY id)), NULL), 1) ]
       AS new_value
FROM tab
ORDER BY id

The code duplication does not looks nice, it would be much easier if there is way to access last element with some_array[-1].

db<>fiddle demo

Using Lukas's suggestion:

SELECT *,
 to_json(ARRAY_REMOVE((ARRAY_AGG(value) OVER(/*PARTITION BY any_group*/ ORDER BY id))
         ,NULL))->-1 AS value_arr
FROM tab
ORDER BY id;

You could use `to_json(array_remove(...))->-1` to avoid duplication, at the price of copying around data. — Lukas Eder, Mar 06 '23 at 16:25

score 0 · Answer 6 · answered Dec 26 '20 at 04:49

In my case I needed to maintain a running balance on non-trading days, which is only the weekends and occasionally a three-day weekend in the case of a non-trading holiday

If the number of empty days is pretty low, you can solve this problem via a CASE statement and a series of LAG window functions:

SELECT
    CASE
        WHEN balance IS NULL THEN
            -- A non-null balance must be found within the first 3 preceding rows
            CASE
                WHEN LAG(balance, 1) OVER () IS NOT NULL
                  THEN LAG(balance, 1) OVER ()
                WHEN LAG(balance, 2) OVER () IS NOT NULL
                  THEN LAG(d.balance, 2) OVER ()
                WHEN LAG(balance, 3) OVER () IS NOT NULL
                  THEN LAG(balance, 3) OVER ()
                END
        ELSE balance
    END
FROM daily_data;

Not practical for an unbounded problem, but a nice solution to be aware of for minor gaps. Simply add more "WHEN LAG(, x) ..." clauses if necessary. I was fortunate that I only need to do this with one column and that this solution unblocked me from my aim

zoolus · Answer 7 · 2021-03-30T16:01:00.417

0

with p (id, value) as (
    values (1, 100),
           (2, null),
           (3, null),
           (4, null),
           (5, null),
           (6, null),
           (7, null),
           (8, 200),
           (9, null))
select *
     , (json_agg(value) filter (where value notnull) over (order by id) ->> -1)::int
from p
;

Then we will use aggregate function with filter option.

edited Mar 30 '21 at 16:01

answered Mar 30 '21 at 12:38

zoolus

11
1

score -1 · Answer 8 · answered Jul 12 '17 at 17:30

-1

You can use LAST_VALUE with FILTER to achieve what you need (at least in PG 9.4)

WITH base AS (
SELECT 1 AS id , 100 AS val
UNION ALL
SELECT 2 AS id , null AS val
UNION ALL
SELECT 3 AS id , null AS val
UNION ALL
SELECT 4 AS id , null AS val
UNION ALL
SELECT 5 AS id , 200 AS val
UNION ALL
SELECT 6 AS id , null AS val
UNION ALL
SELECT 7 AS id , null AS val
)
SELECT id, val, last(val) FILTER (WHERE val IS NOT NULL) over(ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) new_val
  FROM base

answered Jul 12 '17 at 17:30

Zoran Stipanicev

502
2
4

6

FILTER with non-aggregate window-function is not implemented in Postgres yet. – pensnarik Dec 11 '17 at 10:09
Or, sadly, 11. Despite this being the cleanest solution. – Craig Ringer Jan 16 '19 at 14:28

How do I efficiently select the previous non-null value?

8 Answers8

Linked

Related