Efficiently Join Two Tables with Several WHEREs

Question

I meet a problem when I try to join two large tables and select entries with WHERE...OR clause:

SELECT A.a, B.b 
FROM A JOIN B 
ON A.equal = B.equal 
WHERE A.condition1 > 100 OR B.condition2 > 200

Here A and B are very very large tables, and the entries need to match the WHERE...OR conditions in different tables. May I ask if there are any approaches I can optimize this clause?

Thx in advance.

Does separate WHERE conditions selectivity is high (by each separate table)? If true then try to apply the condition in the subquery. — Akina, Jun 01 '20 at 04:38
You might find this helpful: https://stackoverflow.com/questions/13750475/sql-performance-union-vs-or — Nick, Jun 01 '20 at 04:40
Or see https://meta.stackoverflow.com/questions/333952/why-should-i-provide-a-minimal-reproducible-example-for-a-very-simple-sql-query — Strawberry, Jun 01 '20 at 07:09
@SGM . . . I am guessing that you are using Hive, so I removed the MySQL label (it doesn't actually matter for the answer, except for indexes). You should tag the database appropriately. — Gordon Linoff, Jun 01 '20 at 12:08

score 0 · Answer 1 · answered Jun 01 '20 at 04:38

OR conditions are usually performance killer.

If your query is running slow, one alternative approach that you can try is to union two subqueries, like:

SELECT A.a, B.b 
FROM A JOIN B ON A.equal = B.equal 
WHERE A.condition1 > 100
UNION
SELECT A.a, B.b 
FROM A JOIN B ON A.equal = B.equal 
WHERE B.condition2 > 100

You want indexes on A(condition1, equal) and B(condition2, equal) for performance.

If you can guarantee that there is no overlap between the resultset of the subqueries (ie there is no row where both conditions are true at the same time), then you can change UNION to UNIN ALL: it is far more efficient, since the database does not need to look for duplicates across the resultsets.

Gordon Linoff · Answer 2 · 2020-06-01T12:10:04.940

This is a variation on GMB's answer. But because you care about performance, it is worth highlighting. You want to use UNION ALL with appropriate indexes:

SELECT A.a, B.b 
FROM A JOIN B ON A.equal = B.equal 
WHERE A.condition1 > 100
UNION ALL
SELECT A.a, B.b 
FROM A JOIN B ON A.equal = B.equal 
WHERE B.condition2 > 100 AND A.condition1 <= 100;

The use of UNION ALL removes a step to remove duplicates.

The above assumes that A.condition1 is never NULL. That is easily handled but just complicates the query a little.

In Hive, this will only really improve performance if one or both of the conditions are quite selective -- by reducing the number of rows that are matched in the JOIN. If each selects, say, 90% of the rows, then your version is probably pretty similar.

If your database can make use of indexes, then you want:

A(condition1, equal, a)
B(equal, b)
B(condition2, equal, b)
A(equal, a, condition1)

score 0 · Answer 3 · answered Jun 01 '20 at 18:20

Study the execution logs. If it is a skew (single reducer runs slow), see this answer. And if there is no skew, then

try to increase parallelism:

Example settings for Mappers on Tez:

set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat;
set tez.grouping.max-size=32000000;
set tez.grouping.min-size=32000;

Example settings for Mappers if you decide to run on MR instead of Tez:

set mapreduce.input.fileinputformat.split.minsize=32000; 
set mapreduce.input.fileinputformat.split.maxsize=32000000;

--example settings for reducers:

set hive.exec.reducers.bytes.per.reducer=32000000; --decrease this to increase the number of reducers, increase to reduce parallelism

Play with these settings. Success criteria is more mappers/reducers and your map and reduce stages are running faster.

See also: https://stackoverflow.com/a/48487306/2700344

@GordonLinoff recipe will work fine if WHERE conditions are selective and smaller filtered dataset fit in memory(before join) and map-join is enabled: set hive.auto.convert.join=true;

Efficiently Join Two Tables with Several WHEREs

3 Answers3