SQL Distinct keyword bogs down performance?

Question

I have received a SQL query that makes use of the distinct keyword. When I tried running the query it took at least a minute to join two tables with hundreds of thousands of records and actually return something.

I then took out the distinction and it came back in 0.2 seconds. Does the distinct keyword really make things that bad?

Here's the query:

SELECT DISTINCT
    c.username, o.orderno, o.totalcredits, o.totalrefunds,
    o.recstatus, o.reason 
FROM management.contacts c 
    JOIN management.orders o ON (c.custID = o.custID)
WHERE o.recDate > to_date('2010-01-01', 'YYYY/MM/DD')

Can you post the actual query? In my experience, adding the distinct keyword is often a hack to correct a poorly written query. — Joe Stefanelli, Mar 17 '11 at 15:32
Frankly, I don't actually understand the purpose of the distinct statement, and the guy that wrote it is long gone... — MxLDevs, Mar 17 '11 at 15:47
@Keikoku: Does the query return a different number of rows with/without the distinct? — Joe Stefanelli, Mar 17 '11 at 15:51
I am currently letting it run. It's in the 100 000+ and still going. Anyways what is the purpose of using distinct when there are so many columns selected? I just can't imagine when that would be useful — MxLDevs, Mar 17 '11 at 15:56
@Keikoyu: I suspect the same thing. It seems like `o.orderno` could be enough to make each row unique, but I obviously don't know your data. — Joe Stefanelli, Mar 17 '11 at 16:00
I just asked around and some mentioned that it's possible to have duplicate information, but really, not when I throw in all those columns. If the only reason for using distinct is to return unique results then having two or three columns (date, orderno, AND name) that make it highly impossible for duplicate data to occur should be enough. In fact maybe I should just throw in the column that looks like an index number... Thanks for your help. — MxLDevs, Mar 17 '11 at 16:04

Benoit · Accepted Answer · 2011-03-17T15:39:21.310

18

Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.

Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).

edited Mar 17 '11 at 15:39

answered Mar 17 '11 at 15:32

Benoit

76,634
23
210
236

3

Just a small side-note: It won't necessarily be done by ordering the result. The database is free to apply any strategy it thinks is fine to calculate the distinct rows. Oracle can use an index or hashing depending on which columns should be distinct. Both won't result in an sort step. But if sorting is involved, then this could well be quite costly. – Mar 17 '11 at 15:36
@a_horse_with_no_name, (even smaller side not on a side note): hash is an index, too. so granted, no sorting is necessary, but building an index is. – Unreason Mar 17 '11 at 16:11
2

@Unreason: but hashing won't sort the result. That's what I wanted to point out. And an index scan is not necessarly sorted either – Mar 17 '11 at 19:45

score 12 · Answer 2 · answered Mar 17 '11 at 20:06

Distinct always sets off alarm bells to me - it usually signifies a bad table design or a developer who's unsure of themselves. It is used to remove duplicate rows, but if the joins are correct, it should rarely be needed. And yes there is a large cost to using it.

What's the primary key of the orders table? Assuming it's orderno then that should be sufficient to guarantee no duplicates. If it's something else, then you may need to do a bit more with the query, but you should make it a goal to remove those distincts! ;-)

Also you mentioned the query was taking a while to run when you were checking the number of rows - it can often be quicker to wrap the entire query in "select count(*) from ( )" especially if you're getting large quantities of rows returned. Just while you're testing obviously. ;-)

Finally, make sure you have indexed the custID on the orders table (and maybe recDate too).

Yes, I found the reason why distinct was used, and it was because they wanted to join all the orders and clients together but ended up with duplicate orders cause for some reason multiple records were made for a single order containing different information that is irrelevant to this query. But the count tip is nice, it seems like there are 6 million records. — MxLDevs, Mar 18 '11 at 13:26

score 4 · Answer 3 · answered Mar 17 '11 at 16:09

Purpose of DISTINCT is to prune duplicate records from the result set for all the selected columns.

If any of the selected columns is unique after join you can drop DISTINCT.
If you don't know that, but you know that the combination of the values of selected column is unique, you can drop DISTINCT.

Actually, normally, with properly designed databases you rarely need DISTINCT and in those cases that you do it is (?) obvious that you need it. RDBMS however can not leave it to chance and must actually build an indexing structure to establish it.

Normally you find DISTINCT all over the place when people are not sure about JOINs and relationships between tables.

Also, in classes when talking about pure relational databases where the result should be a proper set (with no repeating elements = records) you can find it quite common for people to stick DISTINCT in to guarantee this property for purposes of theoretical correctness. Sometimes this creeps in into production systems.

Is it common practice to put DISTINCT? I took a quick glance at some other queries that were written by the same person and each and every one of them uses DISTINCT, even if he's querying a column that keeps a unique number for every entry that's added to the table. In this case the number is not needed, but maybe I should just toss it in to guarantee uniqueness. — MxLDevs, Mar 17 '11 at 16:34
No it should not be common practice. It kills performance (unless query planner can determine it is superfluous; I don't know how well oracle does that). You should know from the cardinality of your joins, uniqueness of columns, conditions that you apply and the results that you expect if you need it or not. — Unreason, Mar 17 '11 at 16:50

score 0 · Answer 4 · answered Mar 17 '11 at 17:19

You can try to make a group by like this:

  SELECT c.username, 
         o.orderno, 
         o.totalcredits, 
         o.totalrefunds,
         o.recstatus, 
         o.reason
    FROM management.contacts c,
         management.orders o
   WHERE c.custID = o.custID
     AND o.recDate > to_date('2010-01-01', 'YYYY-MM-DD')
GROUP BY c.username, 
         o.orderno, 
         o.totalcredits, 
         o.totalrefunds,
         o.recstatus, 
         o.reason

Also verify if you have index on o.recDate

SQL Distinct keyword bogs down performance?

4 Answers4

Linked