-1

In SQL , I am not able to decide whether to use Union or Union All Using Group By in Union All Eliminates Duplicate records? or is it advisable to use Union with Group By

Error_2646
  • 2,555
  • 1
  • 10
  • 22
Raj
  • 1
  • 1
  • 1
    What database are you using (ex. MySQL, SQL Server, etc..)? The right thing to do in each situation will depend on that. In any case, you can probably just try it both ways and look at the explain plan. – Error_2646 Jun 04 '18 at 17:51
  • 1
    All depend on what you want to do. Show us db schema, sample data, current and expected output. Please read [**How-to-Ask**](http://stackoverflow.com/help/how-to-ask) And here is a great place to [**START**](http://spaghettidba.com/2015/04/24/how-to-post-a-t-sql-question-on-a-public-forum/) to learn how improve your question quality and get better answers. – Juan Carlos Oropeza Jun 04 '18 at 17:53
  • You use what you need/want. If possible (i.e. both return the same result) `union all` should be used because it's not DISTINCT. – dnoeth Jun 04 '18 at 17:54
  • `UNION` **Eliminate** duplicates `UNION ALL` **KEEP** duplicates – Juan Carlos Oropeza Jun 04 '18 at 17:54
  • SQL Server 2016, i want Distinct results. as UNION ALL is Better in Performance than UNION . can i use Group BY and UNION ALL for distinct results ? – Raj Jun 04 '18 at 17:59
  • The elimination of duplicates is the costly thing. So a `UNION ALL` will let you gain nothing if you later eliminate the duplicates anyway. It might be even better to do a necessary elimination as early as possible to keep the set small should it be input to another operation. `UNION ALL` is to be preferred only if you a want duplicates or if you union two sets you know that they are disjunct (and duplicate free), because in that case `UNION` and `UNION ALL` will give the same result but the DBMS won't (needlessly!) search for possible duplicates to eliminate when `UNION ALL`. – sticky bit Jun 04 '18 at 18:11
  • @Raj Here's a reference: https://stackoverflow.com/questions/35627923/union-versus-select-distinct-and-union-all-performance/35628095?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa You really just have to look at the explain plan. My hunch is that you are better off with just UNION, but the explain plan will remove the guess work. Also, if you don't know how to read an execution plan, it's a very valuable thing https://stackoverflow.com/questions/7359702/how-do-i-obtain-a-query-execution-plan?utm_medium=organic&utm_source=google_rich_qa&utm_campaign=google_rich_qa – Error_2646 Jun 04 '18 at 18:15

3 Answers3

0

The answer is a bit more nuanced. It is my understanding that UNION will remove duplicates between your first query and the subsequent one(s) run after the union. However, if your first data set contains duplicates, I believe that those are not removed.

Group By is more used when you are attempting to aggregate some data. For example, with a Sum(), Min() Max(), that type of thing. If you are not doing any aggregation, you won't need a group by(). If you are doing an aggregation within one or more queries within the union, you will need both UNION and Group By.

Lastly, if you want to eliminate ALL duplicates, and you do not have any aggregation in your first query but you are seeing duplicates, you could also throw a distinct on the first query, which would eliminate them from your initial data set as well.

Beth
  • 3
  • 1
  • 3
    `UNION` removes all duplicates, regardless where they come from. – sticky bit Jun 04 '18 at 18:04
  • His question is just about eliminated duplicates when doing a UNION. Sometimes people will GROUP BY all columns without aggregate functions to remove duplicates. Depending on the database and column statistics this can be more efficient. – Error_2646 Jun 04 '18 at 18:05
  • 1
    @Error_2646: but the claim that UNION does not remove duplicates that occur inside one of the queries is still wrong. And for no modern DBMS a `GROUP BY` over all columns is more efficient than a `DISTINCT` (or a union) –  Jun 04 '18 at 18:30
  • 1
    @a_horse_with_no_name I don't think I made that claim. Also, I've only been deep enough in the weeds with Teradata to speak to it, but there are absolutely cases where group by over all columns is faster than distinct, namely when the majority of records are duplicated. – Error_2646 Jun 04 '18 at 18:34
  • @Error_2646: Regarding Teradata, this choice is usually done by the optimizer, no matter if you write DISTINCT or GROUP BY. – dnoeth Jun 04 '18 at 20:08
  • @dnoeth Thanks for the correction. I'll definitely defer to you on that. – Error_2646 Jun 04 '18 at 20:19
0

The GROUP BY will eliminate duplicates in each subquery, but UNION will eliminate duplicates among the subqueries. For example, this will only give unique names in both the ClientList1 and ClientList2 tables:

SELECT [Name] FROM ClientList1 GROUP BY [Name]
UNION
SELECT [Name] FROM ClientList2 GROUP BY [Name]

But this will include names that exist in BOTH tables:

SELECT [Name] FROM ClientList1 GROUP BY [Name]
UNION ALL
SELECT [Name] FROM ClientList2 GROUP BY [Name]
Russell Fox
  • 5,273
  • 1
  • 24
  • 28
0

The basis of choosing between UNION and UNION ALL is that UNION takes longer but can remove some duplicates. If you expect you will not have duplicates that UNION will remove or it is OK to have those duplicates in the output, use UNION ALL

Marlin Pierce
  • 9,931
  • 4
  • 30
  • 52