3

I have a very slow query due to scanning through millions of records. The query searches for how many numbers are in a specific range.

I have 2 tables: numbers_in_ranges and person table

Create table numbers_in_ranges
(   range_id number(9,0) , 
    begin_range number(9,0), 
    end_range number(9,0)   
 ) ;

Create table person
(
    id integer,
    a_number varchar(9),
    first_name varchar(25),
    last_name varchar(25)
);

Data for numbers_in_ranges

range_id| begin_range | end_range
--------|------------------------
101     | 100000000   | 200000000
102     | 210000000   | 290000000
103     | 350000000   | 459999999
104     | 461000000   | 569999999
106     | 241000000   | 241999999
e.t.c.

Data for person


id | a_number   | first_name | last_name
---|------------|------------|-----------
1  | 100000001  | Maria      | Doe
2  | 100000999  | Emily      | Davis
3  | 150000000  | Dave       | Smith
4  | 461000000  | Jane       | Jones
6  | 241000001  | John       | Doe
7  | 100000002  | Maria      | Doe
8  | 100009999  | Emily      | Davis
9  | 150000010  | Dave       | Smith
10 | 210000001  | Jane       | Jones
11 | 210000010   | John       | Doe
12 | 281000000  | Jane       | Jones
13 | 241000000  | John       | Doe
14 | 460000001  | Maria      | Doe
15 | 500000999  | Emily      | Davis
16 | 550000010  | Dave       | Smith
17 | 461000010  | Jane       | Jones
18 | 241000020  | John       | Doe
e.t.c.

We are getting the range data from a remote database via a database link and storing it in a materialized view.

The query

select nums.range_id, count(p. a_number) as a_count
from number_in_ranges nums
left join person p on to_number(p. a_number)
between nums.begin_range and nums.end_range 
group by nums.range_id;

The result looks like

range_id| a_count 
--------|------------------------
101     | 6  
102     | 5  
103     | 2   
104     | 3  
e.t.c 

As I said, this query is very slow.

Here is the explain plan

Plan hash value: 3785994407
 
---------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                         | Name               | Rows  | Bytes |TempSpc| Cost (%CPU)| Time     |    TQ  |IN-OUT| PQ Distrib |
---------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                  |                    |  9352 |   264K|       | 42601  (31)| 00:00:02 |        |      |            |
|   1 |  PX COORDINATOR                   |                    |       |       |       |            |          |        |      |            |
|   2 |   PX SEND QC (RANDOM)             | :TQ10002           |  9352 |   264K|       | 42601  (31)| 00:00:02 |  Q1,02 | P->S | QC (RAND)  |
|   3 |    HASH GROUP BY                  |                    |  9352 |   264K|       | 42601  (31)| 00:00:02 |  Q1,02 | PCWP |            |
|   4 |     PX RECEIVE                    |                    |  9352 |   264K|       | 42601  (31)| 00:00:02 |  Q1,02 | PCWP |            |
|   5 |      PX SEND HASH                 | :TQ10001           |  9352 |   264K|       | 42601  (31)| 00:00:02 |  Q1,01 | P->P | HASH       |
|   6 |       HASH GROUP BY               |                    |  9352 |   264K|       | 42601  (31)| 00:00:02 |  Q1,01 | PCWP |            |
|   7 |        MERGE JOIN OUTER           |                    |  2084M|    56G|       | 37793  (23)| 00:00:02 |  Q1,01 | PCWP |            |
|   8 |         SORT JOIN                 |                    |  9352 |   173K|       |     3  (34)| 00:00:01 |  Q1,01 | PCWP |            |
|   9 |          PX BLOCK ITERATOR        |                    |  9352 |   173K|       |     2   (0)| 00:00:01 |  Q1,01 | PCWC |            |
|  10 |           MAT_VIEW ACCESS FULL    | NUMBERS_IN_RANGES  |  9352 |   173K|       |     2   (0)| 00:00:01 |  Q1,01 | PCWP |            |
|* 11 |         FILTER                    |                    |       |       |       |            |          |  Q1,01 | PCWP |            |
|* 12 |          SORT JOIN                |                    |    89M|   850M|  2732M| 29681   (1)| 00:00:02 |  Q1,01 | PCWP |            |
|  13 |           BUFFER SORT             |                    |       |       |       |            |          |  Q1,01 | PCWC |            |
|  14 |            PX RECEIVE             |                    |    89M|   850M|       |  4944   (1)| 00:00:01 |  Q1,01 | PCWP |            |
|  15 |             PX SEND BROADCAST     | :TQ10000           |    89M|   850M|       |  4944   (1)| 00:00:01 |  Q1,00 | P->P | BROADCAST  |
|  16 |              PX BLOCK ITERATOR    |                    |    89M|   850M|       |  4944   (1)| 00:00:01 |  Q1,00 | PCWC |            |
|  17 |               INDEX FAST FULL SCAN| PERSON_AN_IDX      |    89M|   850M|       |  4944   (1)| 00:00:01 |  Q1,00 | PCWP |            |
---------------------------------------------------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
  11 - filter("NUMS"."END_RANGE">=TO_NUMBER("P"."A_NUMBER"(+)))
  12 - access("NUMS"."BEGIN_RANGE"<=TO_NUMBER("P"."A_NUMBER"(+)))
       filter("NUMS"."BEGIN_RANGE"<=TO_NUMBER("P"."A_NUMBER"(+)))
 
Note
-----
   - automatic DOP: Computed Degree of Parallelism is 16 because of degree limit
   

I tried to run the deltas for the month and then append them to the table, like: if new range_id is found then insert if range_id is found then update So we don't have to scan the whole table.

But this solution didn't work because some ranges are updated, and splicing happens, for example:

We create a new range_id = 110 with ranges between 100110000 and 210000001 then range_id = 101 is spliced to 100000000 and 100110000 and range_id = 102 is spliced to 100110001 and 210000000 ;

Now I thought of creating a trigger for when a new range is created or updated to update that table; however, that is impossible since we are getting this data from a remote database that stores the data into a Materialized View, and we cannot put a trigger on a read-only materialized view.

My question is there any other way that I can do this or optimize this query?

Thank you!

Mariana
  • 349
  • 3
  • 18
  • First, `left join person p on to_number(p. a_number)` is invalid syntax. I assume there's supposed to be a `=` in there somewhere? And is your `a_number` really stored as a varchar2? Otherwise, there's no need for the `to_number` – eaolson Sep 09 '22 at 01:18
  • did you run an explain plan? – OldProgrammer Sep 09 '22 at 02:02
  • @OldProgrammer I have added the query plan. Please take a look. – Mariana Sep 09 '22 at 18:25
  • @eaolson the syntax is right. yes A_number is stored as varchar2 – Mariana Sep 09 '22 at 18:27
  • Do you have a function-based index on `to_number(p.a_number)`? It's using `PERSON_AN_IDX`, but it's not clear what that index is. – eaolson Sep 13 '22 at 19:12

1 Answers1

1

The issue is that Oracle tries to broadcast the table with all ID's that looks quite strange for this case.

However, since you need only to count rows and (it looks like) the intervals do not overlap, you may improve the performance and avoid join of two datasets using a trick: transform the data to event stream where each start and end value identifies the beginning and end of series and then count the number of events in this series. This way you may use match_recognize which is dramatically faster than join.

The query will be:

with ranges_unpivot as (
  /*Transform from_ ... to_... to the event-like structure*/
  select
    id
    , val
    , val_type
  from ranges_table
  unpivot(
    val for val_type in (from_num as '01_START', to_num as '03_END')
  )

  union all

  /*Append the rest of the data to the event stream*/
  select
    null,
    id,
    /*
      This should be ordered between START mark and END mark
      to process edge cases correctly
    */
    '02_val'
  from other_table
  where id <= (select max(to_num) from ranges_table)
)

select /*+parallel(4) gather_plan_statistics*/ *
from ranges_unpivot
match_recognize (
  order by val asc, val_type asc
  measures
    start_.id as range_id,
    count(values_.val) as count_
  pattern (start_ values_* end_)
  define
    start_ as val_type = '01_START',
    values_ as val_type = '02_val',
    end_ as val_type = '03_END'
)

which shows this time in the query plan:

| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:00.33

Compared to join query:

select /*+gather_plan_statistics*/
  rt.id as range_id,
  count(ot.id) as count_
from ranges_table rt
  left join other_table ot
  on rt.from_num <= ot.id
  and rt.to_num >= ot.id
group by rt.id

which shows:

| 0 | SELECT STATEMENT | | 1 | | 1 |00:00:13.84 |

See db<>fiddle.

astentx
  • 6,393
  • 2
  • 16
  • 25