Why can't I seem to force Oracle 11g to consume more CPUs for a single SQL query

Question

I have some humongous queries that run on gigantic tables. These queries seem to be CPU bottlenecked, and run for hours. I understand that Oracle has a lot of new features with 11g, release 2 to internally paralellize the execution of a query. However no matter what kinds of hints I put in the query, I cant seem to use more than 1 CPU on the database box. I have a very respectable Solaris machine with 8 CPUs, however everytime I run this query, I end up just pushing one CPU to 100% and then sit there for hours.

The hints I have tried are:

SELECT /*+ PARALLEL */ ...
SELECT /*+ PARALLEL(5) */ ...
SELECT /*+ PARALLEL(10) */ ...

None of that appeared to work when looking at overall CPU consumption on the box. It always seemed to peg one CPU at 100%. Unfortunately even explain plan seems to take forever to run. I will try and get different explain plans with the different hints and see if that helps. Is it possible that some queries are simply un-paralleable, even if their runtime is in the hours?!!? The main table in this query has 335 million rows.

SQL Query Text:

http://pastie.org/8634380

System Parameters:

http://pastie.org/8634383

Edit:

Detailed Explain Plan - No Parallelism:

http://pastebin.com/HkZgbPpf

Optimizer related system parameters:

http://pastie.org/8639841

Further EDIT: We have reached out to Oracle to understand why EXPLAIN PLAN takes more than 2 hours. We are timing out trying to run the various explain plans.

Did you read the documentation -http://docs.oracle.com/cd/B19306_01/server.102/b14223/usingpe.htm — OldProgrammer, Jan 15 '14 at 02:46
It's very unusual for an explain plan to take more than a few seconds. Other than a bug, the only reason I can think of is if dynamic_sampling is set to 10, and will "sample" the entire table. Which would also explain a parallel query only using one CPU - that recursive SQL is not run in parallel. Unfortunately that information is not in the explain plan you posted. I'm not sure what tool you used, but if you use `explain plan for ...` and `select * from table(dbms_xplan.display)` it will include that setting in the `Note` section. Also, the SQL statement in the plan has no parallel hint. — Jon Heller, Jan 16 '14 at 05:40
I added a more detailed explain plan. The Note section simply says "star schema used". The explain plan takes well over 3 hours, which is about the time it takes for the query to run as well. I am unable to get the explain plan with paralell hint to complete! :( — NG Algo, Jan 16 '14 at 18:17
dynamic sampling is set to 2. I am adding all oiptimizer parameters to the details as well. Should I disable dynamic_sampling altogether? — NG Algo, Jan 16 '14 at 18:25
looking at the explain_plan, almost all the time is being spent in "hash joins", which is essentially consuming CPU resource only. Not sure whether paralellism helps this particular step much or not. — NG Algo, Jan 16 '14 at 18:29
@NGAlgo See "Another Note" on my answer concerning the hash joins. They're fine, it's that hash group by. — David Aldridge, Jan 16 '14 at 20:03
Keep in mind that explain plans only gives a forecast, and "the forecast is always wrong". This is where SQL monitoring is helpful - it will tell you how long each step is really taking. But perhaps the first problem to tackle is the excessive parse time. It sounds like a bug that will require tracing or Oracle support to track down. There are some weird cases where the parse tree grows exponentially - see my answer [here](http://stackoverflow.com/a/19805883/409172) for an example. But be careful generating the trace - it could generate a massive file in an unexpected place. — Jon Heller, Jan 17 '14 at 05:41

score 14 · Answer 1 · edited Jun 15 '14 at 15:25

The most important thing to understand about Oracle parallelism is that it's complicated. Optimizing parallelism requires a lot of Oracle knowledge, reading the manuals, checking many parameters, testing long-running queries, and a lot of skepticism.

Ask the Right Questions

Parallel problems really involve three different questions:

How many parallel servers were requested?
How many parallel servers were allocated?
How many parallel servers were meaningfully used?

Use the Best Tools

Go straight to the best tool - SQL Monitoring with active reports. Find your SQL_ID and generate the HTML report: select dbms_sqltune.report_sql_monitor(sql_id => 'your_sql_id', type => 'active') from dual;. This is the only way to know how much time was spent on each step in the execution plan. And it will tell you how much parallelism was effectively used, and where. For example: enter image description here

Another good options is type => 'text'. It doesn't have quite as much information but it's a quicker to look at and easier to share.

SQL Monitoring also includes the DOP requested and the DOP allocated: enter image description here

A 100-line parallel select may run beautifully, but then everything halts at a single step because of an uncached sequence. You can stare at an explain plan, a trace, or an AWR report for hours and not see the problem. The active report makes the slow steps almost trivial to find. Do not waste time guessing where the problem lies.

However, other tools are still required. An explain plan generated with explain plan for ... and select * from table(dbms_xplan.display); will provide a few key pieces of information. Specifically the Notes section can include many reasons why the query did not request parallelism.

But WHY did I get that number of parallel servers?

The relevant information is spread over several different manuals, which are very useful but occasionally inaccurate or misleading. There are many myths and much bad advice about parallelism. And the technology changes significantly with each release.

When you put together all of the reputable sources, the list of factors influencing the number of parallel servers is astonishingly large. The list below is ordered roughly by what I think are the most important factors:

Inter-operation parallelism Any query using sorting or grouping will allocate twice as many parallel servers as the DOP. This is probably responsible for the myth "Oracle allocates as many parallel servers as possible!".
Query hint Preferably a statement-level hint like /*+ parallel */, or possibly an object-level hint like /*+ noparallel(table1) */. If a specific step of a plan is running in serial it is usually because of object-level hints on only part of the query.
Recursive SQL Some operations may run in parallel but can be effectively serialized by recursive SQL. For example, an uncached sequence on a large insert. Recursive SQL generated to parse the statement will also be serial; for example dynamic sampling queries.
Alter session alter session [force|enable] parallel [query|dml|ddl]; Note that parallel DML is disabled by default.
Table degree
Index degree
Index was cheaper Parallel hints only tell the optimizer to consider a full table scan with a certain DOP. They do not actually force parallelism. The optimizer is still free to use a serial index-access if it think it's cheaper. (The FULL hint may help solve this issue.)
Plan management SQL Plan Baselines, outlines, profiles, advanced rewrite, and SQL Translators can all change the degree of parallelism behind your back. Check the Note section of the plan.
Edition Only Enterprise and Personal Editions allow parallel operations. Except for the package DBMS_PARALLEL_EXECUTE.
PARALLEL_ADAPTIVE_MULTI_USER
PARALLEL_AUTOMATIC_TUNING
PARALLEL_DEGREE_LIMIT
PARALLEL_DEGREE_POLICY
PARALLEL_FORCE_LOCAL
PARALLEL_INSTANCE_GROUP
PARALLEL_IO_CAP_ENABLED
PARALLEL_MAX_SERVERS This is the upper limit for the whole system. There's a trade-off here. Running too many parallel servers at once is bad for the system. But downgrading a query to serial can be disastrous for some queries.
PARALLEL_MIN_PERCENT
PARALLEL_MIN_SERVERS
PARALLEL_MIN_TIME_THRESHOLD
PARALLEL_SERVERS_TARGET
PARALLEL_THREADS_PER_CPU
Number of RAC nodes Another multiplier for default DOP.
CPU_COUNT If the default DOP is used.
RECOVERY_PARALLELISM
FAST_START_PARALLEL_ROLLBACK
Profile SESSIONS_PER_USER also limits parallel servers.
Resource Manager
System load If parallel_adaptive_multi_user is true. Probably impossible to guess when Oracle will start throttling.
PROCESSES
Parallel DML restrictions Parallel DML will not work if any of these cases:
1. COMPATIBLE < 9.2 for intra-partition
2. INSERT VALUES, tables with triggers
3. replication
4. self-referential integrity or delete cascade or deferred integrity constraints
5. accessing an object column
6. non-partitioned table with LOB
7. intra-partition parallelism with a LOB
8. distributed transaction
9. clustered tables
10. temporary tables
Scalar subqueries do not run in parallel? This is in the manual, and I wish this was true, but my tests indicate that parallelism works here in 11g.
ENQUEUE_RESOURCES Hidden parameter in 10g, is this relevant any more?
Index-organized tables Cannot direct-path insert to IOTs in parallel? (Is this still true?)
Parallel pipelined function requirements Must use a CURSOR(?). TODO.
Functions must be PARALLEL_ENABLE
Type of statement Older versions restricted parallelism on DML depending on partitioning. Some of the current manuals still include this but it is certainly not true anymore.
Number of partitions Only for partition-wise joins on older versions.(?)
Bugs Specifically I've seen a lot of bugs with parsing. Oracle will allocate the right number of parallel servers but nothing will happen as they all wait for events like cursor: pin s wait on x.

This list is certainly not complete, and does not include 12c features. And it doesn't address operating system and hardware issues. And it doesn't answer the horribly difficult question, "what is the best degree of parallelism?" (Short answer: more is usually better, but at the expense of other processes.) Hopefully it at least gives you a sense of how difficult these problems can be, and a good place to start looking.

Thank you Jones, this is an excellent reference point. Let me try and get as many of these data points together and try to get a definitive answer for this particular query. — NG Algo, Jan 15 '14 at 22:35
It might be worth mentioning that `dbms_sqltune.report_sql_monitor` is only available with the performance and diagnostic pack in the Enterprise Edition if I'm not mistaken (but then, to run a parallel query you need EE anyway) — , Apr 11 '15 at 10:22

David Aldridge · Answer 2 · 2014-01-16T20:00:43.287

So Oracle is using a star_transformation here, which is essentially a means of combining multiple bitmap indexes to give a set of rowid's to access on the main table. Using the rowid's to access the table is not a very parallel-amenable operation, in the way that a full table scan would be (or preferably a scan of a subset of partitions).

The star_transformation is being used because Oracle estimates that a relatively small number of rows will match all of those conditions -- it would be interesting to hear if the estimate of 15 million rows is correct or not. 15/335 is about 4.4% of the table rows, so on the face of it the decision to use an index-based method is appropriate unless the rows end up being completely randomly scattered among the blocks and you're accessing 30% of the blocks.

Anyway, I have a feeling that if Oracle is going to choose a star_transformation then the parallel hint becomes irrelevant.

The alternative to the star_transformation would be to use a full table scan on the large table, so I think it would be useful firstly to hint for a full table scan and parallelism. You might also issue an alter session command to disable star transformations temporarily as well, but the important part of hinting is to say exactly what access method you want before you dip into the parallelism.

Further Note: Incidentally, the temporary tables that you see in the plan are due to the star transformation -- when the dimension tables are first scanned to find the applicable rows, Oracle eliminates the need to find the useful subset again during the join-back phase of the query by storing them in a temporary table. You can disable this behaviour, but it's probably doing good here.

Another Note: Had another look at the explain plan -- those times of 00:27:44, 00:27:45 etc are cumulative. They include for example the 00:27:43 for TABLE ACCESS BY INDEX ROWID of /B28/EUDIQSBV. If you take that into consideration, you see that the hash joins take a few seconds each, and the performance hogs is the HASH GROUP BY at around 3 hours. It's using 4GB of temp space according to the estimate. That is probably a serious drain -- monitor the execution using V$SQL_WORKAREA to see how many passes it's doing, but PQ is an answer to that as well as long as you have the storage bandwidth.

The vast majority of time is being spent in Hash Joins. I would assume hash joins can be paralellized. Is that not an accurate assumption? — NG Algo, Jan 16 '14 at 19:31
Hash joins themselves are a practically zero-cost operation unless they have to spill to disk when you consider them on a per-row basis. It's the scan of the tables that generally takes the time -- building the hash table and probing it for the join is very cheap. — David Aldridge, Jan 16 '14 at 19:37
If you look at IDs 9, 17, 25 etc.. of the explain plan, the HASH joins in this particular case do not seem to be zero cost operations like you say. Can you please take another look? — NG Algo, Jan 16 '14 at 19:50
Thank you for that clarification. You seem to indicate that the HASH GROUP BY should benefit from parallel hint, is that correct? Our storage has a lot of bandwidth and our metrics show low utilization/load on the storage overall with very high load on CPU. Do you still believe needing to switch off star_transformation is a pre-requisite to gain benefit from parallel hint? I don't quite know what exactly would I be trying to monitor using V$SQL_WORKAREA. Can you share any links on that? — NG Algo, Jan 16 '14 at 23:14
So V$SQL_WORKAREA gives you metrics on sorting and grouping operations. It will show the memory consumption, peak memory consumption, storage consumption, and the number of passes. http://docs.oracle.com/cd/E11882_01/server.112/e40402/dynviews_3061.htm#REFRN30256 If you have plenty of storage bandwidth then I would certainly try the full table scan in parallel to see how it performs -- alternatively you could try a parallel_index hint to see if it parallelises the star transformation ... just have a look at the plan. Maybe try posting the explain plans for both approaches. — David Aldridge, Jan 17 '14 at 10:24

score 0 · Answer 3 · answered Jan 15 '14 at 07:12

PARALLEL operations can bring value in some specific cases, and if you just put PARALLEL hint into query this does not mean Oracle will parallel anything adhoc. Please see example below:

SQL> select * from t
  2  where id = 14
  3  /


Execution plan
----------------------------------------------------------                      
Plan hash value: 1859958591                                                     

--------------------------------------------------------------------------------
----------------------------                                                    

| Id  | Operation                   | Name                         | Rows  | Byt
es | Cost (%CPU)| Time     |                                                    

--------------------------------------------------------------------------------
----------------------------                                                    

|   0 | SELECT STATEMENT            |                              |  5210 |  21
67K|   478   (1)| 00:00:06 |                                                    

|   1 |  TABLE ACCESS BY INDEX ROWID| T          |  5210 |  21
67K|   478   (1)| 00:00:06 |                                                    

|*  2 |   INDEX RANGE SCAN          | T_FK_I |  5210 |    
   |    20   (0)| 00:00:01 |                                                    

--------------------------------------------------------------------------------
----------------------------                                                    


Predicate Information (identified by operation id):                             
---------------------------------------------------                             

   2 - access("ID"=14)                                                      


Statistics
----------------------------------------------------------                      
         17  recursive calls                                                    
          0  db block gets                                                      
      10607  consistent gets                                                    
          0  physical reads                                                     
          0  redo size                                                          
    3734464  bytes sent via SQL*Net to client                                   
     994894  bytes received via SQL*Net from client                             
       7256  SQL*Net roundtrips to/from client                                  
          0  sorts (memory)                                                     
          0  sorts (disk)                                                       
       5295  rows processed                                                     

SQL> select /*+ parallel(4) */ * from t
  2  where id = 14
  3  /


Execution plan
----------------------------------------------------------                      
Plan hash value: 1859958591                                                     

--------------------------------------------------------------------------------
----------------------------                                                    

| Id  | Operation                   | Name                         | Rows  | Byt
es | Cost (%CPU)| Time     |                                                    

--------------------------------------------------------------------------------
----------------------------                                                    

|   0 | SELECT STATEMENT            |                              |  5210 |  21
67K|   478   (1)| 00:00:01 |                                                    

|   1 |  TABLE ACCESS BY INDEX ROWID| T          |  5210 |  21
67K|   478   (1)| 00:00:01 |                                                    

|*  2 |   INDEX RANGE SCAN          | T_FK_I |  5210 |    
   |    20   (0)| 00:00:01 |                                                    

--------------------------------------------------------------------------------
----------------------------                                                    


Predicate Information (identified by operation id):                             
---------------------------------------------------                             

   2 - access("ID"=14)                                                      

Note                                                                            
-----                                                                           
   - Degree of Parallelism is 1 because of hint                                 


Statistics
----------------------------------------------------------                      
         17  recursive calls                                                    
          0  db block gets                                                      
      10607  consistent gets                                                    
          0  physical reads                                                     
          0  redo size                                                          
    3734464  bytes sent via SQL*Net to client                                   
     994894  bytes received via SQL*Net from client                             
       7256  SQL*Net roundtrips to/from client                                  
          0  sorts (memory)                                                     
          0  sorts (disk)                                                       
       5295  rows processed                                                     

SQL> select /*+ full(t) */ * from t
  2  where id = 14
  3  /



Execution plan
----------------------------------------------------------                      
Plan hash value: 565085413                                                      

--------------------------------------------------------------------------------
---------                                                                       

| Id  | Operation         | Name                | Rows  | Bytes | Cost (%CPU)| T
ime     |                                                                       

--------------------------------------------------------------------------------
---------                                                                       

|   0 | SELECT STATEMENT  |                     |  5210 |  2167K|  3858   (1)| 0
0:00:47 |                                                                       

|*  1 |  TABLE ACCESS FULL| T |  5210 |  2167K|  3858   (1)| 0
0:00:47 |                                                                       

--------------------------------------------------------------------------------
---------                                                                       


Predicate Information (identified by operation id):                             
---------------------------------------------------                             

   1 - filter("ID"=14)                                                      


Statistics
----------------------------------------------------------                      
         17  recursive calls                                                    
          1  db block gets                                                      
      19468  consistent gets                                                    
        507  physical reads                                                     
          0  redo size                                                          
    3734334  bytes sent via SQL*Net to client                                   
     994894  bytes received via SQL*Net from client                             
       7256  SQL*Net roundtrips to/from client                                  
          0  sorts (memory)                                                     
          0  sorts (disk)                                                       
       5295  rows processed                                                     

SQL> select /*+ parallel(4) full(t) */ * from t
  2  where id = 14
  3  /



Execution plan
----------------------------------------------------------                      
Plan hash value: 298470658                                                      

--------------------------------------------------------------------------------
-----------------------------------------                                       

| Id  | Operation            | Name                | Rows  | Bytes | Cost (%CPU)
| Time     |    TQ  |IN-OUT| PQ Distrib |                                       

--------------------------------------------------------------------------------
-----------------------------------------                                       

|   0 | SELECT STATEMENT     |                     |  5210 |  2167K|  1070   (1)
| 00:00:03 |        |      |            |                                       

|   1 |  PX COORDINATOR      |                     |       |       |            
|          |        |      |            |                                       

|   2 |   PX SEND QC (RANDOM)| :TQ10000            |  5210 |  2167K|  1070   (1)
| 00:00:03 |  Q1,00 | P->S | QC (RAND)  |                                       

|   3 |    PX BLOCK ITERATOR |                     |  5210 |  2167K|  1070   (1)
| 00:00:03 |  Q1,00 | PCWC |            |                                       

|*  4 |     TABLE ACCESS FULL| T |  5210 |  2167K|  1070   (1)
| 00:00:03 |  Q1,00 | PCWP |            |                                       

--------------------------------------------------------------------------------
-----------------------------------------                                       


Predicate Information (identified by operation id):                             
---------------------------------------------------                             

   4 - filter("ID"=14)                                                      

Note                                                                            
-----                                                                           
   - Degree of Parallelism is 4 because of hint                                 


Statistics
----------------------------------------------------------                      
         12  recursive calls                                                    
         65  db block gets                                                      
      17262  consistent gets                                                    
      14401  physical reads                                                     
          0  redo size                                                          
    3736075  bytes sent via SQL*Net to client                                   
     994894  bytes received via SQL*Net from client                             
       7256  SQL*Net roundtrips to/from client                                  
          0  sorts (memory)                                                     
          0  sorts (disk)                                                       
       5295  rows processed

As you can see when I just added PARALLEL hint to the query I still got index access. And you should analyse your execution plan and statistic first and get the answer on question is your current plan so poor as you believe ? May be this is the best. As you also can see parallel full scan did significantly more logical reads (including disk reads), so if the degree of parallelism is irrelevant you can get the performance degradation instead of any win. And last but not least - did you consider such oprion as the partitioning ? It can be quite helpful if you need to get just a small part of data in accordance to stable and predictable criteria.

Thank you Dmitry for the example. I am yet to see if indeed the plan is so horrible. I wish I could partition the database tables, but in this case I am unable to do so due to licensing and application limitations. — NG Algo, Jan 15 '14 at 22:31
Hi Dmitry, now that I have published the EXPLAIN PLAN, can you comment on it and your thoughts on whether it should benefit from a PARALLEL hint? HASH GROUP BY seems to be taking the vast chunk of the overall time — NG Algo, Jan 16 '14 at 23:17
HASH GROUP BY does seems that you need (I guess because you have 15M rows output) and this seems natural CBO solution due to HASH JOIN. Your plan makes me thinking you are working with some DWH and have star schema (fact table plus dimentions). A lot of OR filtering may indicate you need some kind of partitioning, in this case parallel oparations will be natural way to do the job. — Dmitry Nikiforov, Jan 17 '14 at 06:08

Why can't I seem to force Oracle 11g to consume more CPUs for a single SQL query

3 Answers3

Linked

Related