How to filter out duplicates from a union query?

Question

I have two similar SELECT queries that retrieve data from the same table "my_table".

-- 1st select
SELECT 
    my_table.id,
    a,
    b
FROM my_table
JOIN table2 ON u = v
JOIN table3 ON x = y

UNION ALL

-- 2st select
SELECT 
    my_table.id,
    a,
    b
FROM my_table
JOIN table2 ON r = s
JOIN table3 ON t = u

Duplicates are to be filtered out under the following conditions: If the second select returns an id that is already present in the 1st select, it should be discarded.

Is there an easy solution without using a common table expression?

Note: The SQL does not have to be a UNION and can also be changed.

Bill Karwin · Answer 1 · 2022-01-30T21:30:26.603

UNION filters out duplicate rows by default. UNION ALL does not remove duplicates.

But the duplicates are based on all columns being identical, not just the id column. If a given id value occurs in both queries, but any of the other two columns are different, then it counts as a distinct row.

If you want to reduce the result to a single row per id, the use a GROUP BY:

SELECT id, ...aggregate expressions... 
FROM (
  SELECT my_table.id, a, b ...
  UNION
  SELECT my_table.id, a, b ...
) AS t
GROUP BY id;

When you GROUP BY id, then any other expressions of the outer select-list must be in aggregate functions like MAX() or SUM(), etc.

The reason it is important to use an aggregate function is that when there are multiple rows with the same id value which you want to reduce to one row, what value should be displayed for a and b?

Example:

id	a	b
4	12	24
4	18	28

If you group by id, you would get one row for id=4, but what value for the other two columns?

id	a	b
4	?	?

Read https://dev.mysql.com/doc/refman/8.0/en/group-by-handling.html for more details on this. Or my answer to Reason for Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

You must use an aggregate function, which includes GROUP_CONCAT() to append all the values from that column in a comma-separated list. Or you can use ANY_VALUE() which picks one of the values from that column arbitrarily.

How do I SELECT id, a, b in the outer select without using aggregate functions? Because I do not need to use any of these functions. — smolo, Jan 30 '22 at 20:17
"but what value for the other two columns?" If the 1st and 2nd selects return a row with the same id, the row of the 2nd select should be discarded. Hence, the a, b values of the 1st select are shown. — smolo, Jan 31 '22 at 05:44

score 0 · Accepted Answer · answered Jan 31 '22 at 15:40

I think this should do it:

-- 1st select
SELECT 
    my_table.id,
    a,
    b
FROM my_table
JOIN table2 ON u = v
JOIN table3 ON x = y
WHERE id NOT IN (
    SELECT 
        my_table.id,
    FROM my_table
    JOIN table2 ON r = s
    JOIN table3 ON t = u
)

UNION ALL

-- 2st select
SELECT 
    my_table.id,
    a,
    b
FROM my_table
JOIN table2 ON r = s
JOIN table3 ON t = u

How to filter out duplicates from a union query?

2 Answers2