How do I write a join with this unusual matching criteria?

Question

I want to "left join" a table so that a value is joined not just to a matching row, but also to any subsequent non-matching rows, up to the next matching row. To put it another way, I want to fill in nulls with the previous non-null value.

Sample data and desired result:

Table x:

 id 
----
  1
  2
  3
  4
  5

Table y:

 id | val 
----+-----
  1 | a
  4 | b

Result of select x.id, y.val from x left join y on x.id=y.id order by x.id;:

 id | val 
----+-----
  1 | a
  2 | 
  3 | 
  4 | b
  5 |

Desired result:

 id | val 
----+-----
  1 | a
  2 | a
  3 | a
  4 | b
  5 | b

A "next" row can only be identified if you apply a sort order. With "next" you mean the row with the next higher `id`? — , Apr 07 '13 at 10:53
Yes, I've updated the example query to add an `order by x.id` — jl6, Apr 07 '13 at 10:55
Your example suggests that values are in ascending order as well. Is that the case? — Erwin Brandstetter, Apr 07 '13 at 12:47
No, a and b just happen to be in order in the example. They could be anything. — jl6, Apr 07 '13 at 22:00

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

6

Indices

Create indices on x.id and y.id - which you probably already have if those are your primary keys.
A multi-column index may help, too, especially with index only scans in pg 9.2+:

CREATE INDEX y_mult_idx ON y (id DESC, val)

However, in my tests, this index was not used at first. Had to add (otherwise pointless) val to ORDER BY to convince the query planner that the sort order matches. See query 3.

The index makes little difference in this synthetic setup. But for tables with more columns, retrieving val from the table becomes increasingly expensive, making the "covering" index more attractive.

Queries

1) Simple

SELECT DISTINCT ON (x.id)
       x.id, y.val
FROM   x
JOIN   y ON y.id <= x.id
ORDER  BY x.id, y.id DESC;

SQL Fiddle.

More explanation for the technique with DISTINCT in this related answer:

Select first row in each GROUP BY group?

I ran some tests because I had my suspicions that the first query wouldn't scale well. It's fast with a small table, but no good with bigger tables. Postgres doesn't optimize the plan and starts with a (limited) cross join, with a cost of O(N²).

2) Fast

This query is still rather simple and scales excellently:

SELECT x.id, y.val
FROM   x
JOIN  (SELECT *, lead(id, 1, 2147483647) OVER (ORDER BY id) AS next_id FROM y) y
       ON  x.id >= y.id
       AND x.id <  y.next_id
ORDER  BY 1;

The window function lead() is instrumental. I make use of the option to provide a default to cover the corner case of the last row: 2147483647 is the biggest possible integer. Adapt to your data type.

3) Very simple and almost as fast

SELECT x.id
     ,(SELECT val FROM y WHERE id <= x.id ORDER BY id DESC, val LIMIT 1) AS val
FROM   x;

Normally, correlated subqueries tend to be slow. But this one can just pick a value from the (covering) index and is otherwise so simple that it can compete.

The additional ORDER BY item val (bold emphasis) seems pointless. But adding it convinces the query planner that it's ok to use the multi-column index y_mult_idx from above, because the sort order matches. Note the

Index Only Scan using y_mult_idx ..

in the EXPLAIN output.

Test case

After a lively debate and multiple updates I collected all queries posted so far and made a test case for a quick overview. I only use 1000 rows so SQLfiddle does not time out with the slower queries. But the top 4 (Erwin 2, Clodoaldo, a_horse, Erwin 3) scale linearly in all my local tests. Updated once more to include my latest addition, improve format and order by performance now:

Big SQL Fiddle comparing performance.

edited Jun 20 '20 at 09:12

Community

1
1

answered Apr 07 '13 at 12:53

Erwin Brandstetter

605,456
145
1,078
1,228

Your fiddle does not appear to have a primary key on `y.id` – wildplasser Apr 07 '13 at 14:50
@wildplasser: no, for the original small test case I just wanted to demonstrate it works. I later ran extensive tests with 10k and 100k rows (including primary keys), before I posted the update. My second query came in fastest, closely followed by a_horse's and Clodoaldo's. Just tested yours, too, it has a similar problem as my first version: doesn't scale well. – Erwin Brandstetter Apr 07 '13 at 14:54
Ok, sorry. Thanks for testing. I also ran it with 10K rows and it did not feel good (and on inspection generated an ugly plan, even after vacuum analyze) strange... – wildplasser Apr 07 '13 at 14:57
@a_horse_with_no_name: Previous comment had numbers for 10k rows, I repost with 100k. All three queries scale linearly. I ran a test case on Postgres 9.1.9 with 100k rows and my 2nd version (not the 1st!) and it took 931 ms, while yours and Clodoaldo's took 1632 ms (exactly the same). [depesz for 10k rows](http://explain.depesz.com/s/Ymou). [depesz for 100k rows](http://explain.depesz.com/s/otIP). No null values, btw - 20k rows in y, so 5 `x` for 1 `y`. – Erwin Brandstetter Apr 07 '13 at 15:26
@a_horse_with_no_name: I tried to reproduce the performance on SQLfiddle 9.2.4 and failed, too. I wonder why ... BTW, why the null values? There are no null values for `val` in the question. Can it be that you took the result of his query as test data? – Erwin Brandstetter Apr 07 '13 at 15:32
Hmm, even with the "correct" test-data, I can't reproduce your results: http://sqlfiddle.com/#!12/235fd/2 There must be something wrong with my test setup I guess – Apr 07 '13 at 15:55
@a_horse_with_no_name: Your test case seems just fine now. I have run several tests by now and find that *performance tests with SQLfiddle are just not reliable*. I took your test setting and the code from the big comparison I added to my answer and could reproduce my results locally. Also consider the fiddle I posted. With 1000 rows it seems to work on SQLfiddle, too. At least in my last test. – Erwin Brandstetter Apr 07 '13 at 16:11
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/27732/discussion-between-erwin-brandstetter-and-a-horse-with-no-name) – Erwin Brandstetter Apr 07 '13 at 18:04
@Erwin, thank you for you effort here! I will test your method in my application and report my findings. – jl6 Apr 07 '13 at 22:01
@Erwin: Sorry for the suspense, but this is for a weekend project and I'm not realistically going to have time to run tests until Sunday. – jl6 Apr 11 '13 at 10:35
1

Well, I got curious and did a preliminary implementation. Erwin2 is 2-3 times faster than Clodoaldo on 3000 rows of data - and more importantly, a where filter appears to be pushed down into the base tables (my actual queries involve 15 or more tables joined like `y`, so it is important that the query plan filters each at the earliest possible stage... but that's another topic for another day). Thanks once again for everyone's help, but the accepted answer must now go to Erwin! – jl6 Apr 11 '13 at 21:03

score 4 · Answer 2 · 2013-04-07T11:17:06.027

select id,
       first_value(t.val) over (partition by group_flag order by t.id) as val
from (
  select x.id, 
         y.val, 
         sum(case
            when y.val is null then 0
            else 1
          end) over (order by x.id) as group_flag
  from x 
    left join y on x.id=y.id
) t    
order by id;

SQLFiddle example: http://sqlfiddle.com/#!12/38903/1

score 4 · Answer 3 · answered Apr 07 '13 at 11:08

4

SQL Fiddle

select
    id, 
    first_value(val) over(partition by g order by id) val
from (
    select
        x.id, val, count(val) over(order by x.id) g
    from
        x
        left join
        y on x.id=y.id
) s
order by id

answered Apr 07 '13 at 11:08

Clodoaldo Neto

118,695
26
233
260

1

Nice trick with the `count()` that's more elegant than my `case` construct. But I see we both had the same idea. – Apr 07 '13 at 11:13
Looks ideal. Do I understand correctly that the count() function is evaluated for each row, performing a running count of non-missing `val`ues over the row's frame (which is all rows from the first to the current)? – jl6 Apr 07 '13 at 11:20
1

@jl6 Yes it is a running count and as it does not count nulls it is perfect for making that grouping. – Clodoaldo Neto Apr 07 '13 at 11:44
@a_horse_with_no_name: I'll be the party-crasher: two window functions and a subquery may be overkill - brought a little something to this party myself. :) – Erwin Brandstetter Apr 07 '13 at 13:06

score 2 · Answer 4 · answered Apr 07 '13 at 11:12

2

SELECT
  m.id,
  y.val
FROM (
  SELECT
    x.id,
    MAX(y.id) id_y
  FROM
    x INNER JOIN y ON x.id >= y.id
  GROUP BY
    x.id
  ) m INNER JOIN y ON m.id_y = y.id
ORDER BY
  m.id

Please see fiddle here.

answered Apr 07 '13 at 11:12

fthiella

48,073
15
90
106

wildplasser · Answer 5 · 2013-04-07T14:42:46.993

2

I like to express the aggregate functions MIN(), MAX(), or closer_to() in terms of (NOT) EXISTS.

SELECT x.id, y.val
FROM   x
JOIN   y ON y.id <= x.id
WHERE NOT EXISTS (SELECT *
        FROM y y2
        WHERE y2.id <= x.id -- same condition AS main query
        AND y2.id > y.id    -- but closer to x.id
        )
        ;

My gut feeling is that this will generate exactly the same query plan as Erwin's answer.

edited Apr 07 '13 at 14:42

answered Apr 07 '13 at 14:27

wildplasser

43,142
8
66
109

1

This variant of your approach should perform better: http://sqlfiddle.com/#!12/38903/8 – Erwin Brandstetter Apr 07 '13 at 15:04
But it is *ugly* ;-) Note: as we speak, I am upgrading to 9.2.4 to see any differences. (there are rumours that 9.2 is slower, I dont believe them, at least not for real-world-queries ...) – wildplasser Apr 07 '13 at 15:12
Didn't say it was a beauty. :) Just faster. If you are looking for elegance, behold what I posted in my answer. ;) – Erwin Brandstetter Apr 07 '13 at 15:16
1

http://www.cs.utexas.edu/~EWD/symposiumProgram.pdf (large PDF alert) " Elegance is not a dispensable luxury but a factor that decides between success and failure." ;-) – wildplasser Apr 07 '13 at 15:25

Iswanto San · Answer 6 · 2013-04-07T11:09:56.353

Use COALESCE and Subquery for the logic.

The subquery will retrieve the last val value.

Try this:

SELECT x1.id,
       coalesce(y1.val, (SELECT val
                       FROM   y
                       WHERE  id = (SELECT Max(x2.id)
                                    FROM   x AS x2
                                           JOIN y AS y2
                                             ON x2.id = y2.id
                                    WHERE  x2.id < x1.id)))
FROM   x AS x1
       LEFT JOIN y AS y1
              ON x1.id = y1.id
ORDER  BY x1.id;

sqlfiddle : http://www.sqlfiddle.com/#!12/42526/1

score 0 · Answer 7 · answered Apr 07 '13 at 11:44

0

I am not sure how you would achieve this using single stored procedure. Logic similar to following would return you the desired result

create PROCEDURE GetData
AS
BEGIN
    Declare @resultTable TABLE(
    id int,
    value varchar(10))

    Declare @id int
    Declare @previousValue varchar(10)
    Declare @currentValue varchar(10)

    DECLARE x_cursor CURSOR
    FOR SELECT id FROM x order by id  

    OPEN x_cursor
    FETCH NEXT FROM x_cursor into @id;
    WHILE (@@FETCH_STATUS = 0)
    BEGIN
        select @currentValue = isnull(val,@previousValue)  from Y where id = @id
        insert into @resultTable values(@id,@currentValue)
        set @previousValue = @currentValue 
        FETCH NEXT FROM x_cursor into @id;
    END


    Close x_cursor
    Deallocate x_cursor

    select * from @resultTable
END
GO

answered Apr 07 '13 at 11:44

JogiKalpesh

1
1

Nice try but the question is Postgresql only and your procedure is SQL Server only – Clodoaldo Neto Apr 07 '13 at 11:46
2

And a stored procedure is absolutely not necessary. – Apr 07 '13 at 12:25
@user2254386 Here's the Sql Server version, albeit it's only on recent version(2012) that Sql Server has windowing functions like first_value. Won't work on Sql Server 2008 http://www.sqlfiddle.com/#!6/7e45d/1 – Michael Buen Apr 07 '13 at 12:42