Convoluted nested CASE WHEN - better to use subqueries in SELECT statement or just join the other table(s)?

Question

Here is pseudocode to describe my current script:

SELECT
  A.ID
, A.YEAR
, CASE WHEN (SELECT B.ID
               FROM TABLE_B
              WHERE B.ID = A.ID
                AND B.YEAR = A.YEAR
                AND B.CONDITION = 'TRUE') = A.ID THEN    
    CASE WHEN (SELECT C.STATUS
                 FROM TABLE_C
                WHERE C.ID = A.ID
                  AND C.YEAR = A.YEAR) = 'X' THEN 'STATUS X'
  ELSE 'STATUS Y' END
  ELSE CASE WHEN (SELECT C.STATUS
                    FROM TABLE_C
                   WHERE C.ID = A.ID
                     AND C.YEAR = A.YEAR) = 'Z' THEN 'STATUS Z'
       ELSE 'STATUS NOT FOUND' END
  END AS STATUS
FROM TABLE_A A

I simplified this pseudocode. My actual script has more subqueries, all hitting the same two tables - this seems overly convoluted and I am wondering if it would be better just to join TABLE_B and TABLE_C to my outer query? Or perhaps to select these fields into a temp table and then to write a cursor that will update the STATUS field?

The web application that uses this script will be used by many people, so performance is definitely a concern.

Joins are generally faster than selects in selects. The reason is each select in select has to be executed for each record processed. so if table_A has 10000 records and you have 3 queries then it has to run 30,000 selects to get your result. Vs selects from join running once for 3 cases and then (hopefully) using index to pair back to table_A. — xQbert, May 02 '17 at 14:26
You say that joins are generally faster than selects in selects - are there any general situations where this is not the case? — Jake, May 02 '17 at 16:14
if both subsets are small then a select in a select can be faster because the engine may determine that a full tablescan on the join is faster than the index. In these cases the select in a select can be faster. If the scalar query returns the SAME result for all records; then it can be faster as the query is run once and pulled from memory for all the susequent records; but in cases where the scalar query is dependant on other columns in the select and must "Lookup" each value each time, generaly speaking the join will be faster. — xQbert, May 02 '17 at 17:34
Ask tom has a couple great articles on it: https://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:1594885400346999596 and https://asktom.oracle.com/pls/asktom/f%3Fp%3D100:11:::::P11_QUESTION_ID:22868017492361 another article here:: https://scankaya.wordpress.com/2010/06/14/62/ and a stack article which bascially says it depends on situation: http://stackoverflow.com/questions/4921227/what-is-better-subqueries-or-inner-joining-ten-tables which is true; but I find use of scalar subqueries is often used wrong; but there are times and places! — xQbert, May 02 '17 at 17:34
So (Select max(ID) from table) or min or avg since the result will be the same EVERY TIME (until the data changes in the table... but correlated subqueries in my mind would be better off as joins since what is being returned each time varies and forces the engine to run a select for every correlation instead of once for all correlations. — xQbert, May 02 '17 at 17:39
It is a bit off topic, but if a major goal of the tuning is to maintain performance against many client calls, perhaps it would be worth considering a fast-refresh materialized view as well, if you have the volume available--if the base joins are expensive, they can be done incrementally while clients get quick return. — alexgibbs, May 02 '17 at 23:59

score 1 · Accepted Answer · answered May 02 '17 at 18:10

Without seeing the real data it is hard to say anything for certain. xQbert has some interesting comments; I would also recommend benchmarking and testing your real potential queries against your real data and compare your results.
The optimizer is smart, but depending on the nature and volume of data in TABLE_A, TABLE_B, TABLE_C, you may get different plans.
The answer may be different, depending on the data and the query. The clearest way to know is to test.

Here's an example:

First, create the test tables and load them. For this example, I'll arbitrarily just throw 20K rows in each, with a high uniformity of data across the three.

CREATE TABLE TABLE_A(ID NUMBER GENERATED ALWAYS AS IDENTITY  NOT NULL PRIMARY KEY, YEAR NUMBER);
CREATE TABLE TABLE_B(ID NUMBER GENERATED ALWAYS AS IDENTITY NOT NULL PRIMARY KEY, YEAR NUMBER, CONDITION VARCHAR2(20));
CREATE TABLE TABLE_C(ID NUMBER GENERATED ALWAYS AS IDENTITY NOT NULL PRIMARY KEY, YEAR NUMBER, STATUS VARCHAR2(20));


INSERT INTO TABLE_A(YEAR) SELECT LEVEL FROM DUAL CONNECT BY LEVEL <= 20000;
INSERT INTO TABLE_B(YEAR,CONDITION)
SELECT LEVEL, DECODE(MOD(LEVEL,2),0,'TRUE','FALSE') FROM DUAL CONNECT BY LEVEL <= 20000;
INSERT INTO TABLE_C(YEAR,STATUS)
  SELECT LEVEL, DECODE(MOD(LEVEL,3),0,'X',1,'Y','Z') FROM DUAL CONNECT BY LEVEL <= 20000;

... gather stats

Then, compare your queries. First with scalars:

SELECT
  A.ID
  , A.YEAR
  , CASE WHEN (SELECT B.ID
               FROM TABLE_B B
               WHERE B.ID = A.ID
                     AND B.YEAR = A.YEAR
                     AND B.CONDITION = 'TRUE') = A.ID THEN
  CASE WHEN (SELECT C.STATUS
             FROM TABLE_C C
             WHERE C.ID = A.ID
                   AND C.YEAR = A.YEAR) = 'X' THEN 'STATUS X'
  ELSE 'STATUS Y' END
    ELSE CASE WHEN (SELECT C.STATUS
                    FROM TABLE_C C
                    WHERE C.ID = A.ID
                          AND C.YEAR = A.YEAR) = 'Z' THEN 'STATUS Z'
         ELSE 'STATUS NOT FOUND' END
    END AS STATUS
FROM TABLE_A A
ORDER BY 1 ASC;

-----------------------------------------------------------------------------------------------  
| Id  | Operation                     | Name          | Rows  | Bytes | Cost (%CPU)| Time     |  
-----------------------------------------------------------------------------------------------  
|   0 | SELECT STATEMENT              |               | 20000 |   156K| 89479   (1)| 00:00:04 |  
|*  1 |  TABLE ACCESS BY INDEX ROWID  | TABLE_B       |     1 |    14 |     2   (0)| 00:00:01 |  
|*  2 |   INDEX UNIQUE SCAN           | SYS_C00409109 |     1 |       |     1   (0)| 00:00:01 |  
|*  3 |   TABLE ACCESS BY INDEX ROWID | TABLE_C       |     1 |    10 |     2   (0)| 00:00:01 |  
|*  4 |    INDEX UNIQUE SCAN          | SYS_C00409111 |     1 |       |     1   (0)| 00:00:01 |  
|*  5 |    TABLE ACCESS BY INDEX ROWID| TABLE_C       |     1 |    10 |     2   (0)| 00:00:01 |  
|*  6 |     INDEX UNIQUE SCAN         | SYS_C00409111 |     1 |       |     1   (0)| 00:00:01 |  
|   7 |  SORT ORDER BY                |               | 20000 |   156K| 89479   (1)| 00:00:04 |  
|   8 |   TABLE ACCESS FULL           | TABLE_A       | 20000 |   156K|    11   (0)| 00:00:01 |  
-----------------------------------------------------------------------------------------------  


Statistics
-----------------------------------------------------------
               4  CPU used by this session
               4  CPU used when call started
              11  DB time
           10592  Requests to/from client
           10591  SQL*Net roundtrips to/from client
           80006  buffer is not pinned count
             766  buffer is pinned count
          117828  bytes received via SQL*Net from client
         2531165  bytes sent via SQL*Net to client
               2  calls to get snapshot scn: kcmgss
               5  calls to kcmgcs
           41127  consistent gets
...
...
etc.

And compare with a join query that returns equivalent data:

SELECT
  TABLE_A.ID,
  TABLE_A.YEAR,
  CASE WHEN TABLE_B.CONDITION = 'TRUE'
    THEN
      DECODE(TABLE_C.STATUS, 'X', 'STATUS X', 'STATUS Y')
  ELSE DECODE(TABLE_C.STATUS, 'Z', 'STATUS Z', 'STATUS NOT FOUND') END AS STATUS
FROM TABLE_A
  LEFT OUTER JOIN TABLE_B
    ON TABLE_A.ID = TABLE_B.ID
       AND TABLE_A.YEAR = TABLE_B.YEAR
  LEFT OUTER JOIN TABLE_C
    ON TABLE_A.ID = TABLE_C.ID
       AND TABLE_A.YEAR = TABLE_C.YEAR
ORDER BY 1 ASC;

----------------------------------------------------------------------------------  
| Id  | Operation              | Name    | Rows  | Bytes | Cost (%CPU)| Time     |  
----------------------------------------------------------------------------------  
|   0 | SELECT STATEMENT       |         | 20000 |   625K|    42  (10)| 00:00:01 |  
|   1 |  SORT ORDER BY         |         | 20000 |   625K|    42  (10)| 00:00:01 |  
|*  2 |   HASH JOIN RIGHT OUTER|         | 20000 |   625K|    40   (5)| 00:00:01 |  
|   3 |    TABLE ACCESS FULL   | TABLE_C | 20000 |   195K|    13   (0)| 00:00:01 |  
|*  4 |    HASH JOIN OUTER     |         | 20000 |   429K|    26   (4)| 00:00:01 |  
|   5 |     TABLE ACCESS FULL  | TABLE_A | 20000 |   156K|    11   (0)| 00:00:01 |  
|   6 |     TABLE ACCESS FULL  | TABLE_B | 20000 |   273K|    14   (0)| 00:00:01 |  
---------------------------------------------------------------------------------- 

Statistics

-----------------------------------------------------------
           1  CPU used by this session
           1  CPU used when call started
          13  DB time
       10592  Requests to/from client
       10591  SQL*Net roundtrips to/from client
      117622  bytes received via SQL*Net from client
     2531166  bytes sent via SQL*Net to client
           2  calls to get snapshot scn: kcmgss
          11  calls to kcmgcs
         160  consistent gets
...
...
etc.

In this example all the tables had similar rows and a 1:1 join relationship.
The queries had different plans, as would be expected.
But what if TABLE_A had 1B rows and very few joined to TABLE_C? Or if TABLE_B were 100% 'TRUE', or data is highly skewed, etc.?
Both the scalar plan and the join plan will differ against different data sets.
Throwing a temp table in the mix would surely perform differently as well (I'd be surprised if it won the day in this kind of scenario, but...)
Testing to find the best fit for your data is the surest way, especially given your comment that performance is a concern.

Thank you for such an in-depth response. – Jake May 03 '17 at 18:00 — Jake, May 03 '17 at 18:00

Convoluted nested CASE WHEN - better to use subqueries in SELECT statement or just join the other table(s)?

1 Answers1