Should I use distinct in my queries

Question

Where I am working I have been recently told that using distinct in your queries is a bad sign of a programmer. So I am wondering I guess the only way to not use this function is to use a group by .

It was my understanding that the distinct function works very similarly to a group by except in how its read. A distinct function checks each individual selection criteria vs a group by which does the same thing only done as a whole.

Keep in mind I only do reporting . I do not create/alter the data. So my question is for best practices should I be using distinct or group by. If neither then is there an alternative. Maybe the group by should be used in more complex queries than my non-real example here, but you get the idea. I could not find an answer that really explained why or why not I should use distinct in my queries

select distinct
    spriden_user_id as "ID",
    spriden_last_name as "last",
    spriden_first_name as "first",
    spriden_mi_name as "MI",
    spraddr_street_line1 as "Street",
    spraddr_street_line2 as "Street2",
    spraddr_city as "city",
    spraddr_stat_code as "State",
    spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'

VS

select
    spriden_user_id as "ID",
    spriden_last_name as "last",
    spriden_first_name as "first",
    spriden_mi_name as "MI",
    spraddr_street_line1 as "Street",
    spraddr_street_line2 as "Street2",
    spraddr_city as "city",
    spraddr_stat_code as "State",
    spraddr_zip as "zip"
from spriden, spraddr
where spriden_user_id = spraddr_id
and spraddr_mail_type = 'MA'
group by "ID","last","first","MI","Street","Street2","city","State","zip"

The two are pretty much equivalent. If you are improving your SQL skills, then learn to use explicit `JOIN` clauses. Never use commas in the `from` clause. — Gordon Linoff, Nov 11 '15 at 13:24
Using `DISTINCT` in your query when you intend to get distinct information is a good indication of a programmer using common sense. The funny thing about common sense is that it is not that common. — zedfoxus, Nov 11 '15 at 13:29
As mentioned in some of the answers, `DISTINCT` (and sometimes wide `GROUP BY`s) often gets used when somebody hasn't completely understood the datamodel, and wishes to avoid duplicates - which are usually better dealt with in another fashion (more restrictive joins, pre-aggregation, or whatever). Generally speaking, if whatever you're trying to make distinct is wider than your index keys (>3-5 columns), or the columns aren't on an index, there's often a better way. — Clockwork-Muse, Nov 11 '15 at 14:33

score 25 · Accepted Answer · edited May 23 '17 at 11:53

25

Databases are smart to recognize what you mean. I expect both of your queries to perform equally well. It is important for someone else maintaining your query to know what you meant. If you really meant to retrieve distinct records, use DISTINCT. If your intention was to do aggregation, use GROUP BY

Take a look at this question. There are some nice answers that might help.

edited May 23 '17 at 11:53

Community

1
1

answered Nov 11 '15 at 13:25

zedfoxus

35,121
5
64
63

1

This does give me some additional information on the subject. I wish I had found that previously. Thanks! – Taku_ Nov 11 '15 at 13:31

score 10 · Answer 2 · answered Nov 11 '15 at 13:43

The answer provided by @zedfoxus is useful to understand the context.

However, I don't believe your query should require distinct records if the data is designed correctly.

It appears you are selecting the primary key of table spriden, so all that data should be unique. You're also joining onto the spraddr table; does that table really contain valid duplicate data? Or is there perhaps an additional join criterium that's required to filter out those duplicates?

This is why I get nervous about use of "distinct" - the spraddr table may include additional columns which you should use to filter out data, and "distinct" may be hiding that.

Also, you may be generating a massive result set which needs to be filtered by the "distinct" clause, which can cause performance issues. For instance, if there are 1 million rows in spraddr for each row in spriden, and you should use the "is_current" flag to find the 2 or 3 "real" ones.

Finally, I get nervous when I see "group by" used as a substitute for distinct, not because it's "wrong", but because stylistically, I believe group by should be used for aggregate functions. That's just a personal preference.

This answer is really nice; it takes a step back to think about: why do we have duplicates? — zedfoxus, Nov 11 '15 at 17:31

score 7 · Answer 3 · answered Nov 11 '15 at 13:26

In your example distinct and group by do the same thing. I think your colleagues means that your query should not return duplicates in the first instance and that you should be able to write your query without a distinct or group by clause. You maybe be able to reduce the duplicates by extending your join conditions.

score 5 · Answer 4 · answered Nov 11 '15 at 13:27

5

Ask them why is it a bad practice. A lot of people make up rules or come up with things that they consider bad practice from reading the first page of the book or the first result of a google search. If it does the job and doesn't cause any issues there is no reason to create more work by finding alternatives. From the two options you have posted I would use distinct too because its shorter and easier to read and maintain.

answered Nov 11 '15 at 13:27

DonO

1,030
1
13
27

I think it used to be bad but with the new oracle optimizer it is smart enough to realize what you want. – Aaron Rabinowitz Aug 23 '17 at 18:10
`:thums_up ` But always is important to doubt about own actions, to improve. – candlejack Jan 18 '18 at 02:23
1

It is bad practice because it is a bandage for a database lacking normalization (a bad sign of the database architect). However I generally observe it used by cargo-cultist developers who used it as a silver bullet for some duplicate issue at one point in the far-flung past and now just use it everywhere without consideration (a bad sign of the developer). – Elaskanator Dec 10 '18 at 17:30

Stewart · Answer 5 · 2019-11-18T16:32:45.087

Whoever told you using DISTINCT is a bad sign in itself is wrong. In reality, it all depends on what problem you are trying to solve by using DISTINCT in the first place.

If you're querying a table that is expected to have repeated values of some field or combination of fields, and you're reporting a list of the values or combinations of values (and not performing any aggregations on them), then DISTINCT is the most sensible thing to use. It doesn't really make sense in my mind to use GROUP BY instead just because somebody thinks DISTINCT shouldn't be used. Indeed, I think this is the kind of thing DISTINCT is designed for.

If OTOH you've found that your query has a bug meaning that repeated values are being returned, you shouldn't use either DISTINCT or GROUP BY to cancel out this bug. Rather, you should figure out the cause of the bug and fix it.

Using DISTINCT as a safety net is also a poor practice, as it potentially hides problems, and furthermore it can be computationally expensive (typically O(n log n) or O(n²)). In this scenario, I can't see that using GROUP BY instead would help you.

I see the last two cases as the most prominent reason. "Oh this query is throwing a PK violation, let's just slap a `DISTINCT` on it." — Every junior SQL developer I've ever seen. — Elaskanator, Nov 18 '19 at 15:12
@Elaskanator Just to clarify, when you say "the most prominent reason", do you mean the most prominent purpose for which junior SQL developers (mis)use `DISTINCT`? — Stewart, Nov 18 '19 at 16:31
Yes, the purpose being to shut up the error message or make duplicates go away. — Elaskanator, Nov 18 '19 at 19:40

drew sutherland · Answer 6 · 2020-01-16T10:22:31.980

Yes, Distinct tends to raise a little alarm in my head when I come across it in someones' query. It is required in some cases ofcourse, but most data models should not require it. It tends to be a last resort, or outlier case, for having to use it. It may also be systemic of a bad application sitting ontop of the database, allowing duplicate entries to be inserted or updated to be duplicates (and likewise, no corresponding database level constraints to prevent such actions). So the first thing to check is the data. It could be a sign of bad datamodel design. But most likely the query should not get to that stage in a select where duplicate rows are lingering.

In constructing a large query, normally I would start with the nugget of a subquery which is specifying the unique fields, and any subquery after that must Inner join or Left join onto that but never add or reduce the number of rows already defined by the nugget query.. and remembering to handle the possible NULLs of the left joins.

So for example, the nugget query could select the right rows also by using Partitions to, for example, select the most recent row of a joined table, or to do some other grouping at that stage.

In your example, I would not expect duplicates. If a person can have historical addresses, fine, but then do you need to see all addresses, or only the most recent, and if there were duplicate addresses, for the same person, does that mean incorrectly duplicated data, or does it mean the person left that address but returned to it later... in which case the partition select would fix that with much better control than a distinct.. especially when fields are added to the query by someone else later and breaks the distinct-ness.

This means that all other data hangs off this nugget of a sub query.. you stick the other possible fields onto the right of the core set of fields.

If Distincts are a last resort, then they are typically reserved for cases where the data is known to have duplicate entries in that table for that set of fields, and it's perfectly normal. In my head though a distinct is a slow, post-select process in the plan especially when it's a large result set being returned. I ought to verify that one of these days.

score 0 · Answer 7 · answered Jan 29 '18 at 19:26

Provided your queries are correct, DISTINCT and GROUP BY provide the same result set, but your colleagues are correct in stating that DISTINCT hides problems. If you are missing a join and using a GROUP BY, you'll get back more information than you're expecting. If you are missing a join and using DISTINCT the SQL engine will perform an unbounded (or partially bounded) join, narrow the results down, and then come up with the expected answer.

Beyond the obvious performance degradation of generating more data than is necessary, you also run the risk of filling your tempdb (i.e.: running out of room on the hard drive where your tempdb lives).

Use GROUP BY in production.

That would depend on what you group by and what aggregations you have. If you just remove the `DISTINCT` and `GROUP BY` all output fields as per the OP's example, there will be no change to the behaviour. — Stewart, Mar 29 '19 at 10:21

Should I use distinct in my queries

7 Answers7