Fill missing values with first non-null following value in Redshift

Question

I am on Redshift. Given the following data:

CREATE TABLE test (
id INT,
val1 INT,
val2 INT
);

INSERT INTO test VALUES
(1, 0,  NULL),
(2, 0,  NULL),
(3, 13, 1),
(4, 0,  NULL),
(5, 0,  NULL),
(6, 0,  NULL),
(7, 0,  NULL),
(8, 21, 2),
(9, 0,  NULL),
(10, 143,3)
;

I'd want to fill the missing val2 values with the first following non-null value, e.g.

   INSERT INTO results VALUES
    (1, 0,  1),
    (2, 0,  1),
    (3, 13, 1),
    (4, 0,  2),
    (5, 0,  2),
    (6, 0,  2),
    (7, 0,  2),
    (8, 21, 2),
    (9, 0,  3),
    (10,143,3)
    ;

What is the best way to accomplish this in Redshift/Postgres 8.0.2?

Thanks for specifically mentioning that you're on Redshift, not just "PostgreSQL". The usual strategy would be a join over `generate_series`, but I don't know if you have that on Redshift, and Amazon don't offer the SQLFiddle guys accounts so I can't really test. — Craig Ringer, Jul 02 '14 at 01:12
... and the usual workarounds are with procedural languages (not on Redshift), window functions (not on redshift), etc. — Craig Ringer, Jul 02 '14 at 04:20
@CraigRinger window functions are supported in redshift. At any rate, I would love to see a solution with generate_series(). — Roberto, Jul 02 '14 at 06:13

Roberto · Answer 1 · 2014-07-02T00:13:37.580

One way I've been able to solve it (taking advantage of the fact that the non-null val2 values are sequential) is below. The performance is awful though, so any better solutions would be more than welcome.

SELECT
  t1.id
  , t1.val1
  , COALESCE(t1.val2, MIN(t2.val2)) as val2
FROM test t2 LEFT JOIN test t1 ON t2.id >= t1.id
WHERE t2.val2 IS NOT NULL
AND t1.val1 IS NOT NULL
GROUP BY 1, 2, t1.val2
ORDER BY t1.id
;

SQLFiddle link

score 0 · Answer 2 · answered Jul 02 '14 at 06:20

I don't see how you'll get much better. You mention the val2 values being sequential, but all that's needed for your solution is that they are increasing. The COALESCE isn't really needed. I find this version easier to read... and it's possible that it will be incrementally faster since there's no need to group on val2. But it's not a fundamental change.

SELECT
  t1.id
  , t1.val1
  , min(t2.val2)
FROM test t1
LEFT OUTER JOIN test t2 on (t1.id <= t2.id and t2.val2 is not null)
GROUP BY t1.id, t1.val1
ORDER BY t1.id
;

score 0 · Answer 3 · edited May 23 '17 at 12:34

This works for any values in val2. They don't need to be sequential and NULL values can appear anywhere (including the last row).

SELECT t1.id, t1.val1, COALESCE(t1.val2, t2.val2) as val2
FROM   test t1
LEFT   JOIN test t2
          ON  t2.id > t1.id
          AND t1.val2 IS NULL
          AND t2.val2 IS NOT NULL
          AND NOT EXISTS (
             SELECT 1
             FROM   test t3
             WHERE  t3.id > t1.id
             AND    t3.id < t2.id
             AND    t3.val2 IS NOT NULL
             )
ORDER  BY t1.id;

It also removes a corner case bug in your query: The WHERE clause would remove trailing rows with val2 IS NULL. You would have to pull that condition up into the JOIN clause. Details:
Query with LEFT JOIN not returning rows for count of 0

Not sure whether its faster than a CROSS JOIN / min() in Redshift.

score 0 · Answer 4 · answered May 16 '17 at 13:15

You can avoid JOINs and play with window functions with the following:

SELECT id, val1, val2, 
       COALESCE(val2, LEAD(val2, dist::int) OVER (ORDER BY id)) AS notNullVal2
FROM (
  SELECT id, val1, val2, c,
          ROW_NUMBER() OVER (PARTITION BY c ORDER BY id DESC) AS dist
  FROM (
    SELECT id, val1, val2,
      COUNT(val2) OVER (ORDER BY id ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS c
    FROM test
  )
)
ORDER BY id

Fill missing values with first non-null following value in Redshift

4 Answers4

Linked