Joining tables if the reference exists

Question

I got a PostgreSQL database with 4 tables:

Table A

---------------------------
| ID | B_ID | C_ID | D_ID |
---------------------------
| 1  |  1   | NULL | NULL |
---------------------------
| 2  | NULL |   1  | NULL |
---------------------------
| 3  |  2   |   2  |   1  |
---------------------------
| 4  | NULL | NULL |   2  |
---------------------------

Table B

-------------
| ID | DATA |
-------------
| 1  | 123  |
-------------
| 2  | 456  |
-------------

Table C

-------------
| ID | DATA |
-------------
| 1  | 789  |
-------------
| 2  | 102  |
-------------

Table D

-------------
| ID | DATA |
-------------
| 1  | 654  |
-------------
| 2  | 321  |
-------------

I'm trying to retrieve a result set which has joined the data from table B and the data from table C, only if one of booth IDs is not null.

SELECT "Table_A"."ID", "Table_A"."ID_B", "Table_A"."ID_C", "Table_A"."ID_D", "Table_B"."DATA", "Table_C"."DATA"
    FROM "Table_A"
        LEFT JOIN "Table_B" on "Table_A"."ID_B" = "Table_B"."ID"
        LEFT JOIN "Table_C" on "Table_A"."ID_C" = "Table_C"."ID"
    WHERE "Table_A"."ID_B" IS NOT NULL OR "Table_A"."ID_C" IS NOT NULL;

Is this recommended or should I better split this in multiple queries?

Is there a way to do an inner join between these tables?

The result I expect is:

-------------------------------------------------
| ID | ID_B | ID_C | ID_D | DATA (B) | DATA (C) |
-------------------------------------------------
| 1  |   1  | NULL | NULL |   123    |   NULL   |
-------------------------------------------------
| 2  | NULL |  1   | NULL |   NULL   |   789    |
-------------------------------------------------
| 3  |   2  |  2   | NULL |   456    |   102    |
-------------------------------------------------

EDIT: ID_B, ID_C, ID_D are foreign keys to the tables table_b, table_c, table_d

You did not show the FK constraints. Are table_a.id_b and table_a.id_c FKs to the B and C tables? (it appears they are, but this is not mentioned in the question) — wildplasser, May 25 '13 at 11:17
In that case my answer is equivalent to the query in your question. — wildplasser, May 25 '13 at 12:16

wildplasser · Accepted Answer · 2013-05-25T11:30:02.693

5

The WHERE "Table_A"."ID_B" IS NOT NULL OR "Table_A"."ID_C" IS NOT NULL; can be replaced by the corresponding clause on the B and C tables : WHERE "Table_B"."ID" IS NOT NULL OR "Table_C"."ID" IS NOT NULL; . This would also work if table_a.id_b and table_a.id_c are not FKs to the B and C tables. Otherwise, a table_a row with { 5, 5,5,5} would retrieve two NULL rows from the B and C tables.

SELECT ta."ID" AS a_id
        , ta."ID_B" AS b_id
        , ta."ID_C" AS c_id
        , ta."ID_D" AS d_id
        , tb."DATA" AS bdata
        , tc."DATA" AS cdata
FROM "Table_a" ta
LEFT JOIN "Table_B" tb on ta."ID_B" = tb."ID"
LEFT JOIN "Table_C" tc on ta."ID_C" = tc."ID"
WHERE tb."ID" IS NOT NULL OR tc."ID" IS NOT NULL
        ;

edited May 25 '13 at 11:30

answered May 25 '13 at 11:01

wildplasser

43,142
8
66
109

Better :-). Apart from your query being a better solution in the event of missing foreign key constraints, do you have any performance benchmarks for the two variants? I tend to think that with indexes put on the foreign key, the original query would be faster (and more readable), unless the optimiser can make the necessary transformation... – Lukas Eder May 25 '13 at 11:33
Yes, it differs from the OQ's query, except in the case where ra.id_b and ta.id_c have FK constraints to the B and C tables. See my comment on the OQ. Otherwise: semantics will be different, plans will be different, performance will be different. (in the simple / small cases, two hash-joins will be used, and the difference will be very small) – wildplasser May 25 '13 at 11:38

score 3 · Answer 2 · answered May 25 '13 at 18:38

Since you have foreign key constraints in place, referential integrity is guaranteed and the query in your Q is already the best answer.

Also indexes on Table_B.ID and Table_C.ID are given.

If matching cases in Table_A are rare (less than ~ 5 %, depending on row with and data distribution) a partial multi-column index would help performance:

CREATE INDEX table_a_special_idx ON "Table_A" ("ID_B", "ID_C")
WHERE "ID_B" IS NOT NULL OR "ID_C" IS NOT NULL;

In PostgreSQL 9.2 a covering index (index-only scan in Postgres parlance) might help even more - in which case you would include all columns of interest in the index (not in my example). Depends on several factors like row width and frequency of updates in your table.

score 2 · Answer 3 · edited May 23 '17 at 12:08

Given your requirements, your query seems good to me.

An alternative would be to use nested selects in the projection, but depending on your data, indexes and constraints, that might be slower, as nested selects usually result in nested loops, whereas joins can be performed as merge joins or nested loops:

SELECT 
    "Table_A"."ID", 
    "Table_A"."ID_B", 
    "Table_A"."ID_C", 
    "Table_A"."ID_D", 
    (SELECT "DATA" FROM "Table_B" WHERE "Table_A"."ID_B" = "Table_B"."ID"),
    (SELECT "DATA" FROM "Table_C" WHERE "Table_A"."ID_C" = "Table_C"."ID")
FROM "Table_A"
WHERE "Table_A"."ID_B" IS NOT NULL OR "Table_A"."ID_C" IS NOT NULL;

If Postgres does scalar subquery caching (as Oracle does), then nested selects might help in case you have a lot of data repetition in Table_A

score 0 · Answer 4 · answered May 25 '13 at 11:16

Generally spealking the recommended way is to do it in one query only, and let the database do as much work as possible, especially if you add other operations like sorting (order by) or pagination later (limit ... offset ...) later. We have done some measurements, and there is no way to sort/paginate faster in Java/Scala, if you use any of the higher level collections like lists etc.

RDBMS deal very well with single complex statements, but they have difficulties in handling many small queries. For example, if you query the "one" and the "many relation" in one query, it will be faster than doing this in 1 + n select statements.

As for the outer join, we have done measurements, and there is no real performance penalty compared with inner joins. So if your data model and/or your query require an outer join, just do it. If it was a performance problem, you can tune it later.

As for your null comparisons, it might indicate that your data model could be optimized, but that is just a guess. Chances are that you can improve the design so that null is not allowed in these columns.

Conderning your last paragraph: it's common and perfectly normal to have null values in such columns. I don't see how that would indicate a problem. — Erwin Brandstetter, May 25 '13 at 18:17
@Erwin That's what I intended to describe. As English is not my native language, what can I improve to make that clear? — Beryllium, May 25 '13 at 18:47
Your English seems just fine. I would delete the last paragraph. — Erwin Brandstetter, May 25 '13 at 19:36

Joining tables if the reference exists

4 Answers4