77

I have a table in Postgres that looks like this:

# select * from p;
 id | value 
----+-------
  1 |   100
  2 |      
  3 |      
  4 |      
  5 |      
  6 |      
  7 |      
  8 |   200
  9 |          
(9 rows)

And I'd like to query to make it look like this:

# select * from p;
 id | value | new_value
----+-------+----------
  1 |   100 |    
  2 |       |    100
  3 |       |    100
  4 |       |    100
  5 |       |    100
  6 |       |    100
  7 |       |    100
  8 |   200 |    100
  9 |       |    200
(9 rows)

I can already do this with a subquery in the select, but in my real data I have 20k or more rows and it gets to be quite slow.

Is this possible to do in a window function? I'd love to use lag(), but it doesn't seem to support the IGNORE NULLS option.

select id, value, lag(value, 1) over (order by id) as new_value from p;
 id | value | new_value
----+-------+-----------
  1 |   100 |      
  2 |       |       100
  3 |       |      
  4 |       |
  5 |       |
  6 |       |
  7 |       |
  8 |   200 |
  9 |       |       200
(9 rows)
adamlamar
  • 4,629
  • 2
  • 27
  • 22

8 Answers8

157

I found this answer for SQL Server that also works in Postgres. Having never done it before, I thought the technique was quite clever. Basically, he creates a custom partition for the windowing function by using a case statement inside of a nested query that increments a sum when the value is not null and leaves it alone otherwise. This allows one to delineate every null section with the same number as the previous non-null value. Here's the query:

SELECT
  id, value, value_partition, first_value(value) over (partition by value_partition order by id)
FROM (
  SELECT
    id,
    value,
    sum(case when value is null then 0 else 1 end) over (order by id) as value_partition

  FROM p
  ORDER BY id ASC
) as q

And the results:

 id | value | value_partition | first_value
----+-------+-----------------+-------------
  1 |   100 |               1 |         100
  2 |       |               1 |         100
  3 |       |               1 |         100
  4 |       |               1 |         100
  5 |       |               1 |         100
  6 |       |               1 |         100
  7 |       |               1 |         100
  8 |   200 |               2 |         200
  9 |       |               2 |         200
(9 rows)
Community
  • 1
  • 1
adamlamar
  • 4,629
  • 2
  • 27
  • 22
19

You can create a custom aggregate function in Postgres. Here's an example for the int type:

CREATE FUNCTION coalesce_agg_sfunc(state int, value int) RETURNS int AS
$$
    SELECT coalesce(value, state);
$$ LANGUAGE SQL;

CREATE AGGREGATE coalesce_agg(int) (
    SFUNC = coalesce_agg_sfunc,
    STYPE  = int);

Then query as usual.

SELECT *, coalesce_agg(b) over w, sum(b) over w FROM y
  WINDOW w AS (ORDER BY a);

a b coalesce_agg sum 
- - ------------ ---
a 0            0   0
b ∅            0   0
c 2            2   2
d 3            3   5
e ∅            3   5
f 5            5  10
(6 rows)
Slobodan Pejic
  • 522
  • 3
  • 9
5

Well, I can't guarantee this is the most efficient way, but works:

SELECT id, value, (
    SELECT p2.value
    FROM p p2
    WHERE p2.value IS NOT NULL AND p2.id <= p1.id
    ORDER BY p2.id DESC
    LIMIT 1
) AS new_value
FROM p p1 ORDER BY id;

The following index can improve the sub-query for large datasets:

CREATE INDEX idx_p_idvalue_nonnull ON p (id, value) WHERE value IS NOT NULL;

Assuming the value is sparse (e.g. there are a lot of nulls) it will run fine.

MatheusOl
  • 10,870
  • 3
  • 30
  • 28
  • Thanks! This is effectively what I already have in terms of a subquery. Definitely works. I didn't realize Postgres allowed you to create an index and combine it with a condition. Thats pretty awesome. – adamlamar Sep 25 '13 at 18:04
2

Another possibility is to built a sum:

WITH CTE_Data(Company, ValueDate, Amount)
AS(
    SELECT 'Company', '2021-05-01', 1000    UNION
    SELECT 'Company', '2021-05-02', 1250    UNION
    SELECT 'Company', '2021-05-03', NULL    UNION
    SELECT 'Company', '2021-05-04', NULL    UNION
    SELECT 'Company', '2021-05-05', 7500    UNION
    SELECT 'Company', '2021-05-06', NULL    UNION
    SELECT 'Company', '2021-05-07', 3200    UNION
    SELECT 'Company', '2021-05-08', 3400    UNION
    SELECT 'Company', '2021-05-09', NULL    UNION
    SELECT 'Company', '2021-05-10', 7800
)

SELECT 
     d.[Company]
    ,d.[ValueDate]
    ,d.[Amount]
    ,d.[Partition]
    ,SUM(d.[Amount]) OVER(PARTITION BY d.[Company], d.[Partition]) AS [Missing]
FROM(
    SELECT
         d.[Company]
        ,d.[ValueDate]
        ,d.[Amount]
        ,SUM(CASE WHEN d.[Amount] IS NULL THEN 0 ELSE 1 END) OVER (PARTITION BY d.[Company] ORDER BY d.[ValueDate]) AS [Partition]
    FROM CTE_Data AS d 
) AS d
Bettelbursche
  • 433
  • 6
  • 14
2

It is possible to emulate LAG(...)/LEAD(...) IGNORE NULLS with ARRAY_AGG and accessing last NOT NULL value:

WITH cte AS (
  SELECT *,
    ARRAY_REMOVE((ARRAY_AGG(value) OVER(/*PARTITION BY group*/ ORDER BY id)), NULL) 
       AS value_arr
  FROM tab
)
SELECT id, value,
    value_arr[array_upper(value_arr, 1)] AS new_value,
    value_arr -- debug
FROM cte
ORDER BY id;

For input data:

CREATE TABLE tab(id INT, value INT);
INSERT INTO tab(id, value)
VALUES (1,100),(2,NULL),(3,NULL),(4,NULL),(5,NULL),
       (6, NULL),(7,NULL),(8,200),(9,NULL);

enter image description here


Condensed version:

SELECT *,
    (ARRAY_REMOVE((ARRAY_AGG(value) OVER( ORDER BY id)), NULL))
    [ARRAY_UPPER(ARRAY_REMOVE((ARRAY_AGG(value) OVER( ORDER BY id)), NULL), 1) ]
       AS new_value
FROM tab
ORDER BY id

The code duplication does not looks nice, it would be much easier if there is way to access last element with some_array[-1].

db<>fiddle demo


Using Lukas's suggestion:

SELECT *,
 to_json(ARRAY_REMOVE((ARRAY_AGG(value) OVER(/*PARTITION BY any_group*/ ORDER BY id))
         ,NULL))->-1 AS value_arr
FROM tab
ORDER BY id;
Lukasz Szozda
  • 162,964
  • 23
  • 234
  • 275
  • 1
    You could use `to_json(array_remove(...))->-1` to avoid duplication, at the price of copying around data. – Lukas Eder Mar 06 '23 at 16:25
0

In my case I needed to maintain a running balance on non-trading days, which is only the weekends and occasionally a three-day weekend in the case of a non-trading holiday

If the number of empty days is pretty low, you can solve this problem via a CASE statement and a series of LAG window functions:

SELECT
    CASE
        WHEN balance IS NULL THEN
            -- A non-null balance must be found within the first 3 preceding rows
            CASE
                WHEN LAG(balance, 1) OVER () IS NOT NULL
                  THEN LAG(balance, 1) OVER ()
                WHEN LAG(balance, 2) OVER () IS NOT NULL
                  THEN LAG(d.balance, 2) OVER ()
                WHEN LAG(balance, 3) OVER () IS NOT NULL
                  THEN LAG(balance, 3) OVER ()
                END
        ELSE balance
    END
FROM daily_data;

Not practical for an unbounded problem, but a nice solution to be aware of for minor gaps. Simply add more "WHEN LAG(, x) ..." clauses if necessary. I was fortunate that I only need to do this with one column and that this solution unblocked me from my aim

AuthorOfTheSurf
  • 545
  • 1
  • 3
  • 12
0
with p (id, value) as (
    values (1, 100),
           (2, null),
           (3, null),
           (4, null),
           (5, null),
           (6, null),
           (7, null),
           (8, 200),
           (9, null))
select *
     , (json_agg(value) filter (where value notnull) over (order by id) ->> -1)::int
from p
;

Then we will use aggregate function with filter option.

zoolus
  • 11
  • 1
-1

You can use LAST_VALUE with FILTER to achieve what you need (at least in PG 9.4)

WITH base AS (
SELECT 1 AS id , 100 AS val
UNION ALL
SELECT 2 AS id , null AS val
UNION ALL
SELECT 3 AS id , null AS val
UNION ALL
SELECT 4 AS id , null AS val
UNION ALL
SELECT 5 AS id , 200 AS val
UNION ALL
SELECT 6 AS id , null AS val
UNION ALL
SELECT 7 AS id , null AS val
)
SELECT id, val, last(val) FILTER (WHERE val IS NOT NULL) over(ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING) new_val
  FROM base