Order the output of union of two queries using shared columns from both tables efficiently

Question

I have two queries to get data from different tables w/ the same schema. I want to union the results and order by shared columns in both tables. This needs to be done as efficiently as possible (both tables have huge amount of data, as do the results of both queries). Running just one query with ordering takes ~1 hour. There are approximately 500 million rows in the first table, currently few in the other one (but will increase). The number of items returned by the query is approximately 25 million.

Two tables with the same schema but different data:

tablea:

Name   |  ImpFile   |  ImpTime                  | FieldX  | FieldY
Sam      Imp01        2012-05-16 09:54:02.477     blah      abcde
Ann      Imp01        2012-05-16 09:54:02.478     blah      ldkse
Bart     Import12     2012-05-16 09:55:37.387     blah      dkcke
Sasha    Import12     2012-05-16 09:55:37.385     blah      leele

tableb:

Name   |  ImpFile   |  ImpTime                  | FieldX  | FieldY
Mark     Imp01        2012-05-16 09:54:02.477     blah      lslsk
John     Import12     2012-05-16 09:55:37.384     blah      lmwqd

output should be in this order:

Ann      Imp01        2012-05-16 09:54:02.478
Bart     Import12     2012-05-16 09:55:37.387
John     Import12     2012-05-16 09:55:37.384
Mark     Imp01        2012-05-16 09:54:02.477
Sam      Imp01        2012-05-16 09:54:02.477
Sasha    Import12     2012-05-16 09:55:37.385

I'm thinking something like this might work (am unable to test, don't have access to tables currently), but I think it's inefficient to union and then select * from that, just for the ordering:

SELECT * from
(
  SELECT
    a.Name as field1,
    a.ImpFile as field2,
    a.ImpTime
  FROM
    tablea a
  WHERE
    a.fieldX = "blah" AND
    length(a.fieldY) = 5
  UNION ALL
  SELECT
    b.Name as field1,
    b.ImpFile as field2,
    b.ImpTime
  FROM
    tableb b
  WHERE
    b.fieldX = "blah" AND
    length(b.fieldY) = 5
) foo
ORDER BY field1, field2;

I think it's the most simple efficient way. Also if there's no duplicating data expected (or duplicating is correct) replace `union` to `union all`, because in first case it will act like you add `distinct` word for union result (see http://stackoverflow.com/a/49928/1094048). It can speed up execution a lot — pkuderov, Oct 04 '13 at 19:19
There should be no duplicate data, so union all makes sense. — user1145925, Oct 04 '13 at 19:24
I wonder if the `WHERE` conditions that you're omitting are the bottleneck there. — Mosty Mostacho, Oct 04 '13 at 19:27
Using application code such as .net, coldfusion, etc to sort query results is often faster than using an order by clause in your query. If you are in that situation, look at that option. — Dan Bracuk, Oct 04 '13 at 19:32
Will not be able to use other applications. This must be done all in one query. @Mosty The where conditions are the same for both queries. — user1145925, Oct 04 '13 at 19:36
Seems like it would be more efficient to skip the subselect (or superselect, depending on how you look at it), and just add `order by 1, 2` to the end of the main query. — AndyDan, Oct 04 '13 at 19:39
Is that correct notation and function? You mean "(select X from A where Z union all select X from B where Z) order by 1, 2;"? Basically, you're suggesting to keep my example query as is w/in parens, followed by "order by 1, 2;" ? — user1145925, Oct 04 '13 at 19:54
@AndyDan, I think sql server isn't that stupid so outmost `select * from ...` doesn't affect on execution time at all. But I may wrong, of course, so a couple of tests may be usefull — pkuderov, Oct 04 '13 at 19:55
@pkuderov, the tags indicates he's using Oracle. I did a sample test (obviously my tables were different than his), and the explain plan was better without the `select * from (...`. — AndyDan, Oct 04 '13 at 20:06
@AndyDan oh, thanks, didn't mention semantic of `|` signs :) And since I'm not very friendly with Oracle then there's no choice to debate for me... — pkuderov, Oct 04 '13 at 20:14
How many rows are there in the query? Do you consume all the rows? — Shannon Severance, Oct 07 '13 at 15:55

score 1 · Answer 1 · answered Oct 12 '13 at 19:52

The fastest way to run this query is to remove the inline view and use parallelism.

Here are the average number of seconds to run the test cases at the bottom of this answer.

                   No Parallel       Parallel
Inline View        221               206
No Inline View     167               154

Details

Parallel execution can significantly improve performance of many long-running queries. But it requires Enterprise Edition, adequate resources, a sane configuration, etc. On 11gR2, it may be as simple as SELECT /*+ parallel */ * from .... If you don't have Enterprise Edition, then something like @pkuderov's "poor-man's parallelism" may help.

It is surprising that @AndyDan's suggestion to move the order by inside the inline view works so well. Normally Oracle is smart enough to make those simple query transformations for you. Oracle has obviously put a lot of thought into the performance of UNION ALL statements. They are supported in materialized views, there is the USE_CONCAT hint to enable switching between OR and UNION plans, 12c can concurrently execute branches, etc. It's odd that they missed such a simple performance fix.

An important detail might be missing from this problem. If the query takes an hour to run, the results are not being displayed on a screen. If the data is being stored, much of the execution time may be used to write to disk, log the changes, update indexes, etc. You should include more information about how the results are stored.

The question asks for efficiency, not performance. Answered literally, the most efficient solution is "No Inline View, No Parallel", because parallelism can consume a lot of extra resources for a small benefit. Especially in my test, since I ran it on a desktop with a single hard-drive. Using 2 parallel threads will almost never exactly double performance, but on most servers it should do much better than my test case.

Test setup

--Create sample tables with 67 million rows, gather stats, create table to hold results.
create table tablea nologging as
select 'Sam'   name,  'Imp01'    impfile, timestamp '2012-05-16 09:54:02.477' imptime, 'blah' fieldX, 'abcde' fieldY from dual union all
select 'Ann'   name,  'Imp01'    impfile, timestamp '2012-05-16 09:54:02.478' imptime, 'blah' fieldX, 'ldkse' fieldY from dual union all
select 'Bart'  name,  'Import12' impfile, timestamp '2012-05-16 09:55:37.387' imptime, 'blah' fieldX, 'dkcke' fieldY from dual union all
select 'Sasha' name,  'Import12' impfile, timestamp '2012-05-16 09:55:37.385' imptime, 'blah' fieldX, 'leele' fieldY from dual;

begin
    for i in 1 .. 24 loop
        insert /*+ append */ into tablea select * from tablea;
        commit;
    end loop;
end;
/

create table tableb nologging as
select 'Mark' name, 'Imp01'    impfile, timestamp '2012-05-16 09:54:02.477' imptime, 'blah' fieldX, 'lslsk' fieldY from dual union all
select 'John' name, 'Import12' impfile, timestamp '2012-05-16 09:55:37.384' imptime, 'blah' fieldX, 'lmwqd' fieldY from dual;

begin
    for i in 1 .. 25 loop
        insert /*+ append */ into tableb select * from tableb;
        commit;
    end loop;
end;
/

begin
    dbms_stats.gather_table_stats(user, 'TABLEA');
    dbms_stats.gather_table_stats(user, 'TABLEB');
end;
/

create table results nologging as select name, impfile, imptime from tablea;

Test queries

--#1: Inline view, no parallel.
insert /*+ append */ into results
SELECT * from
(
  SELECT a.Name as field1, a.ImpFile as field2, a.ImpTime
  FROM tablea a WHERE a.fieldX = 'blah' AND length(a.fieldY) = 5
  UNION ALL
  SELECT b.Name as field1, b.ImpFile as field2, b.ImpTime
  FROM tableb b WHERE b.fieldX = 'blah' AND length(b.fieldY) = 5
) foo
ORDER BY field1, field2;


--#2: No inline view, no parallel.
insert /*+ append */ into results
SELECT a.Name as field1, a.ImpFile as field2, a.ImpTime
FROM tablea a WHERE a.fieldX = 'blah' AND length(a.fieldY) = 5
UNION ALL
SELECT b.Name as field1, b.ImpFile as field2, b.ImpTime
FROM tableb b WHERE b.fieldX = 'blah' AND length(b.fieldY) = 5
ORDER BY 1, 2;

--#3: Inline view, parallel.
insert /*+ append */ into results
SELECT /*+ parallel(2) */ * from
(
  SELECT a.Name as field1, a.ImpFile as field2, a.ImpTime
  FROM tablea a WHERE a.fieldX = 'blah' AND length(a.fieldY) = 5
  UNION ALL
  SELECT b.Name as field1, b.ImpFile as field2, b.ImpTime
  FROM tableb b WHERE b.fieldX = 'blah' AND length(b.fieldY) = 5
) foo
ORDER BY field1, field2;

--#4: No inline view, parallel.
insert /*+ append */ into results
SELECT /*+ parallel(2) */ a.Name as field1, a.ImpFile as field2, a.ImpTime
FROM tablea a WHERE a.fieldX = 'blah' AND length(a.fieldY) = 5
UNION ALL
SELECT b.Name as field1, b.ImpFile as field2, b.ImpTime
FROM tableb b WHERE b.fieldX = 'blah' AND length(b.fieldY) = 5
ORDER BY 1, 2;

Test Notes

The test was run on 12c. For simplicity, I removed the truncate table results; after each query, and I don't show how the queries were ran 5 times with the high and low value thrown out. In a real scenario, where you're not just trying to measure select performance, the parallel hint should be moved into the insert statement.

score 0 · Answer 2 · answered Oct 04 '13 at 19:52

If it works more than hour then there're many rows. And it seems that there're too many rows.

You can divide your query execution to some steps and retrieve data by portions. For example divide it by first characters in field1. Get result for users started with 'A', then 'B', etc.
Also you can make it dynamically: write stored procedure that returns <= N (less or equal) rows starting with substring @s for field1 and string @ns - until which substring the data is returned (or null if all data is returned).
It will has varchar(max) @s, int @N, varchar(max) @ns output as parameters and will return data this way:

Count the number of rows which have field1 like @s + '%' for both tables
1. if the number of rows <= @N then return data and output @ns as @s with incremented last character (until z - in this case remove this last character and increment next last character and so on). For example, @s = 'A' ---> @ns = 'B' or @s = 'CKZ' ---> @ns = 'CL'
2. if >= @N then @s --> @s + 'A', e.g. @s = 'KD' ---> @s = 'KDA'
Just execute this stored procedure in a loop starting from @s = '' and after every execution do @s <--- returned @ns until it isn't equal null

If counting step takes too much time then set manually granularity of you filter (i.e. looping from 'A' to 'Z' or from 'AA' to 'ZZ' or from 'AAA' to 'ZZZ' and etc.).
It's just an idea which may help you to doesn't wait an hour until first row for processing is retrieving.

Order the output of union of two queries using shared columns from both tables efficiently

2 Answers2