0

I am selecting a single column of foreign keys from multiple tables through either UNION or UNION ALL.

It is generally recommended to use UNION ALL instead of UNION for performance issues when duplicates do not matter. However, in my calling PHP script it would be more efficient to loop through and manipulate the data without duplicates.

So, I can use either of the following options:

Option 1:

Use UNION in the database to eliminate duplicates

Option 2:

use UNION ALL in the database and eliminate the data in my PHP script using array_unique() or other similar functions.

My assumption is that Option 1 would be the preferred and more efficient method in the majority of cases, however I have nothing to back up that assumption, and not sure the best way to test it especially since it would likely depend a lot on what the data was.

Is my assumption correct in most cases? If so, why? If not, why not?

kojow7
  • 10,308
  • 17
  • 80
  • 135
  • Test both options and you will see. – Shadow Sep 17 '17 at 04:21
  • The answer is that it depends. Test both to find out. – Tim Biegeleisen Sep 17 '17 at 04:24
  • @TimBiegeleisen You linked this as a duplicate, however, there is nothing on that page that answers my question. It just reiterates what I already stated in my question. – kojow7 Sep 17 '17 at 04:54
  • @kojow7 I disagree and if you real all the comments under the two most highly voted answers you will get the answer you are seeking. – Tim Biegeleisen Sep 17 '17 at 04:56
  • @TimBiegeleisen Actually, if you read all the comments under the two most voted answer, you will see that I posted a comment there about an hour ago. I did read the comments on that post and a number of others, and could not find a satisfactory answer to my question. Hence why I am asking one here. – kojow7 Sep 17 '17 at 05:02
  • I still think this runs the risk of being closed. Without knowing and testing your exact data, either `UNION` or `UNION ALL` could be faster. As a rule of thumb, you probably want to let the database remove duplicates as it was designed to do this (your PHP code was not). – Tim Biegeleisen Sep 17 '17 at 05:07
  • @TimBiegeleisen I do agree, but part of the problem is that I currently only have test data, and even if I had real data, next year the data could be completely different and change the results of any test. I was, however, hoping that someone had some generalized insight into the benefits of eliminating duplicates in one system over another. But perhaps no generalized insight exists. – kojow7 Sep 17 '17 at 05:14
  • The general insight is to use `UNION` and allow the database to do its job. But then again that also depends on your data. – Tim Biegeleisen Sep 17 '17 at 05:29

2 Answers2

5

The mainly aspect is that UNION is shortcut for UNION DISTINCT and so

the difference in performance between UNION and UNION ALL are related to the
need to obtain a distinct result and for this the database engine and the query optimizer are surely more effective and most efficient than the filtring alogoritm based on PHP code in application.

The dictinct Operation can, moreover, benefit from the pre-optimizations for group by functionality

Not only, the duplicate data filtering is generally based on ordered data while the select sql functions work without explicit ordering, and therefore the need for filtering data with the application can lead to less efficient and more longer queries.

Generally the db engine is much more efficient that application PHP functions code so the Option 1 is generally the better choise

ScaisEdge
  • 131,976
  • 10
  • 91
  • 107
1

Speed-wise, it is relatively insignificant. The effort to do all the SELECTs is more than to do the de-dup, whichever way you do it.

Therefore, I recommend saying UNION DISTINCT, since that is few keystrokes for you than array_unique(...).

Other considerations:

  • UNION ALL would shovel more stuff from the server to the client; this (in extreme or distant situation) could be a factor in performance.
  • If you are also saying ORDER BY on the UNION, you may as well do the DISTINCT, too.
  • GROUP BY (on the UNION) has the effect of DISTINCT.
  • If you are talking about millions of rows, keep in mind that PHP can hit memory limits on arrays, whereas MySQL is essentially unlimited.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Can you explain the difference between UNION and UNION DISTINCT? I thought UNION automatically did a DISTINCT. – kojow7 Sep 17 '17 at 15:05
  • `UNION` is _identical_ to `UNION DISTINCT`. I recommend always spelling it out so that you (1) think about which to do, and (2) you tell the next reader of the SQL (which might be you in 6 months) that you made a conscious decision. – Rick James Sep 17 '17 at 17:26