using Union and Order By without getting duplicates

Question

I'm using an UNION to merge the results of an exact and a fuzzy search. I want the exact matches to be at the top, and the other results sorted by a column. I found this solution which works fine with UNION ALL, but through adding the rank column I lose the property of UNION (without all) to remove the exact match duplicates from the result set.

Is there an elegant way to solve this, or do I have to manually remove the duplicates?

For reference, my simplified query:

SELECT 1 AS [Rank], [CallerID] 
FROM [PHONE]
WHERE [CallerID] = '12345'

UNION

SELECT 2 AS [Rank], [CallerID] 
FROM [PHONE]
WHERE [CallerID] LIKE '12%' AND ABS(LEN([CallerID]) - LEN('12345')) < 3
ORDER BY [Rank] ASC, [CallerID] ASC

Result could looke like this:

Rank        CallerID
----------- --------------------
1           12345
2           123
2           1233
2           1234
2           12345     <- I don't want this line
2           1236

Remark: setting DISTINCT for my CallerID won't solve the problem, since my real query has more columns. I really only want to remove duplicates between the two result sets i combine throught the UNION.

`ABS(LEN([CallerID]) - LEN('12345')) < 3` this will not filter `12345` — flyingfox, Jul 31 '18 at 10:09
@lucumt `LEN('12345')` is 5, so `ABS(5 - 5) < 3` will be true? — Florian Koch, Jul 31 '18 at 10:41

Damien_The_Unbeliever · Accepted Answer · 2018-07-31T10:24:08.183

Put your existing query inside a CTE (here I'm placing your sample data there instead) and then use a ROW_NUMBER() and a further WHERE to filter the results:

with OriginalQuery as (
select 1 as Rank,  12345 as CallerID union all
select 2 ,123 union all
select 2,1233 union all
select 2,1234 union all
select 2,12345 union all
select 2,1236
), Preferred as (
    select *,ROW_NUMBER() OVER (
        PARTITION BY CallerID /* other columns too? */
        ORDER BY RANK
        ) as rn
    from OriginalQuery
)
select
    *
from
    Preferred
where
    rn = 1
order by Rank,CallerID

As indicated, you may have to add more/adjust columns to the PARTITION if CallerID isn't a key, by itself, for this data.

Of course, if there aren't any duplicates in your underlying data, and the reason you're getting duplicates is just because you're running two searches and combining the results, it's far simpler to do:

SELECT [CallerID] 
FROM [PHONE]
WHERE
    CallerID = '12345' OR
    ([CallerID] LIKE '12%' AND ABS(LEN([CallerID]) - LEN('12345')) < 3)
ORDER BY CASE WHEN CallerID='12345' THEN 0 ELSE 1 END, [CallerID] ASC

Where you combine both searches rather than combining their results and then use CASE to pick out the best matches in the ORDER BY.

score 1 · Answer 2 · answered Jul 31 '18 at 10:15

One variant:

WITH DataSource AS
(
    SELECT 1 AS [Rank], [CallerID] 
    FROM [PHONE]
    WHERE [CallerID] = '12345'
)
SELECT [Rank]
      ,[CallerID]
FROM DataSource

UNION ALL

SELECT 2 AS [Rank], [CallerID] 
FROM [PHONE]
WHERE [CallerID] LIKE '12%' AND ABS(LEN([CallerID]) - LEN('12345')) < 3
    AND [CallerID] NOT IN (SELECT [CallerID] FROM DataSource)
ORDER BY [Rank] ASC, [CallerID] ASC

using Union and Order By without getting duplicates

2 Answers2