0

I'm working on someone else's project. There is a query like this:

SELECT     posts.id, posts.title, posts.body, posts.keywords 
FROM       posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE      tags.name IN ( :keywords )
GROUP BY   posts.id

The new policy is to replace IN with =. So the query I've written looks like this:

SELECT     posts.id, posts.title, posts.body, posts.keywords 
FROM       posts
INNER JOIN pivot ON pivot.post_id = posts.id
INNER JOIN tags ON tags.id = pivot.tag_id
WHERE      tags.name = :keyword
GROUP BY   posts.id

Now I want to know, is GROUP BY redundant in this case? I say so because I think the reason of GROUP BY is omitting duplicate posts which are matched by each keyword.

Martin AJ
  • 6,261
  • 8
  • 53
  • 111
  • Group by, is for matching like records. It is totally different from the IN command. – Chuck Jun 29 '17 at 17:08
  • 1
    `GROUP BY` is used incorrectly in both of these and may deduplicate records as a side effect of incorrect usage. It's intended for `MIN(), MAX(), COUNT(), SUM() etc` aggregates. To deduplicate these, use `SELECT DISTINCT`. – Michael Berkowski Jun 29 '17 at 17:08
  • 2
    Group by isn't simply redundant; it's entirely inappropriate in both instances – Strawberry Jun 29 '17 at 17:09
  • @Strawberry https://stackoverflow.com/questions/44470003/isnt-using-unnormalized-design-better-when-there-are-multiple-joins#comment75935964_44470003 – Martin AJ Jun 29 '17 at 17:12
  • @MichaelBerkowski Are you sure isn't `GROUP BY` redundant in my second query? I mean there won't be any duplicate post in the result. Since it's `=`, not `IN` – Martin AJ Jun 29 '17 at 17:13
  • @MartinAJ The two queries are functionally equivalent. There's no real difference between using `=` or using `IN ()` to compare against a single value. The `IN () ` was not being used to compare against a list of values. But my point is (and @Strawberry too) that this is not how `GROUP BY` is supposed to be used at all for either query. – Michael Berkowski Jun 29 '17 at 17:16
  • Just to be clear: you can replace `IN` with `=` only when the list used consists of a single item. – PM 77-1 Jun 29 '17 at 17:18
  • @PM77-1 I din't get your point honestly. – Martin AJ Jun 29 '17 at 17:20
  • Parameter of `IN` is a **list**. If this list consists of more than one item you will not be able to simply use `=` instead. – PM 77-1 Jun 29 '17 at 22:21

1 Answers1

1

First things first, when using GROUP BY within a SELECT statement every column that is not included within the grouping clause should be wrapped up with an aggregate function.

Just because MySQL allows this kind of odd behaviour doesn't make it best practices. Other DBMS for example PostgreSQL wouldn't allow this query to execute at all.

Saying that, how it works internally within MySQL is just that you get a unique record for each posts.id, but random values from potentially different rows for all non-aggregated and non-grouped column.

You should be using DISTINCT from what I can see.


Answer to your question

Replacing IN with = doesn't affect grouping at all, so you are free to go with it especially if you are not passing list but a single value to that query, but GROUP BY is not redundant in any case (or should be completely removed in both). It would change the output you receive.

If you, for instance, grouped by a unique column within a table and join that to a table with 1:1 relationship GROUP BY would be redundant. As a second example constructing proper WHERE clause with conditions might make it redundant as well.

Kamil Gosciminski
  • 16,547
  • 8
  • 49
  • 72
  • Well I think the result will be always unique. I don't know why you think there may be duplicate posts matched. [Here](https://stackoverflow.com/questions/44444902/how-can-i-select-all-posts-which-have-specific-tags) is my table structure. Still do you believe is `GROUP BY` needed? – Martin AJ Jun 29 '17 at 17:18
  • 1
    As I've said: _"or should be completely removed in both"_. I'm not familiar with primary keys and constraints that you have enforced on your tables. If `posts(id)` is unique and `tags(id)` is unique and `pivot(post_id, tag_id)` is unique then you don't need `GROUP BY` in **BOTH** queries. – Kamil Gosciminski Jun 29 '17 at 17:25
  • Got it .. Just for your information, I guess `GROUP BY` **is** needed in the first query. Since it uses `IN` clause and a post may be matched multiple times for each different tag. Assume a post has both `mysql` and `sql` tags. So that post will be matched twice in this condition `IN('mysql', 'SQL')`, one time for `mysql` and one time for `sql`. So `GROUP BY` is needed when the condition is `IN`. – Martin AJ Jun 29 '17 at 17:32
  • You should get it that it's plain wrong and an anti-pattern to use `GROUP BY` like that in MySQL. For such cases use `DISTINCT` which is a short keyword for `GROUP BY `. Try to write the same query in SQL Fiddle for any other DBMS than MySQL. It will fail. – Kamil Gosciminski Jun 29 '17 at 17:34
  • As far as tags go, I understood that you had the first query with `IN` operator which anyways supported only one tag at once (one value was passed) and which is why you want to change it to `=`. If the case was different, then sure, it would return duplicate rows unless done what I've written in my comment before. Please also stop for a minute on the first part of my answer, this should make it clear. – Kamil Gosciminski Jun 29 '17 at 17:36