How to optimize this query to not fetch duplicate info repeatedly

Question

Say I have two tables, author and book. Each author can have multiple books, identified by the authorId column of the book table which is linked to the id field of author table.

I want to fetch all the books with a certain category Id, and for each book I want to get the author's name and photo url to be displayed next to the book title.

So I do a query like this:

SELECT author.name, author.photoUrl, book.title
    FROM author, book
    WHERE book.categoryId = '3' AND author.id = book.authorId

The problem is, what if an author has multiple books in the category? (E.g a fiction writer writes multiple fiction books). In that case, will the author's info be fetched separately for each row, hence fetching duplicate info, or is there any way such as using DISTINCT so an author's info is only fetched once?

Please show us an example of output vs. the requested output. — Nir Alfasi, Aug 20 '12 at 18:33
@HLGEM Are there any performance issues with it vs other joins? — Ali, Aug 20 '12 at 18:38
o, but it is harder to maintain (espcially if you need to convert to left joins later) and far likelier to have accidental cross joins. This syntax was replaced 20 years ago, would you use C# code that had been replaced that long ago if C# had even be in existance that long ago? — HLGEM, Aug 20 '12 at 18:41
How do you intend to display the result in your application? As parent child relation? — Magnus, Aug 20 '12 at 18:41
@MyItchyChin Showing a list of books in a certain category, and showing the name of author and photo of author next to the book title — Ali, Aug 20 '12 at 18:46
@ClickUpvote But that is exactly what you query does, is it not? — Magnus, Aug 20 '12 at 18:48
@Magnus yes, but since an author can have multiple books in a category, will this query unnecessarily fetch his info repeatedly (i.e fetching duplicates)? (You can submit your reply as an answer) — Ali, Aug 20 '12 at 18:57

score 4 · Answer 1 · answered Aug 20 '12 at 18:34

First, you should use standard join syntax:

SELECT author.name, author.photoUrl, book.title
FROM author join
     book
     on author.id = book.authorId
WHERE book.categoryId = '3'

What you are calling "duplicate" information is exactly what SQL does in a join. There is no issue with this. If you give an example of what you want returned, then there might be a way to reduce the data.

It strikes me that you might want the list of books for an author, as a delimited list. In that case try this:

SELECT author.name, author.photoUrl,
       group_concat(book.title separator ', ') as books
FROM author join
     book
     on author.id = book.authorId
WHERE book.categoryId = '3'
group by author.name, author.photoUrl

Nice. I would use `group by author.id, author.name, author.photoUrl` but still +1 — ypercubeᵀᴹ, Aug 20 '12 at 19:11

Magnus · Accepted Answer · 2012-08-20T19:43:46.667

1

Yes, it would return duplicates on the author but I dont think that is a problem.
An alternative could be to have two queries instead, one for the books and one for the authors, and than in code set up a relationship (in a DataSet or using a Dictionary<int, Author>) between the results on AuthorId.

Books

SELECT 
  book.title,
  book.authorId
FROM 
  book
WHERE 
  book.categoryId = '3'

Authors

SELECT 
  author.Id,
  author.name, 
  author.photoUrl
FROM
  author
WHERE
  EXISTS(
     SELECT NULL
     FROM book
     WHERE 
       book.categoryId = '3' AND
       author.id = book.authorId
  )

edited Aug 20 '12 at 19:43

answered Aug 20 '12 at 19:20

Magnus

45,362
8
80
118

@MyItchyChin actually it is very efficient, it'll create a right semi join between the tables. – Magnus Aug 20 '12 at 19:43
@MyItchyChin using distinct is much worse since it requires a Sort operation on all the columns included in the select – Magnus Aug 20 '12 at 19:48
It depends on the SQL implementation and how it optimizes queries, MS SQL 2k8 generates the same execution plan for either query. That being said older versions of SQL and Oracle do not, the EXISTS was considered to be very inefficient because it executed the sub-query for every row in the query. – MyItchyChin Aug 20 '12 at 19:57
If you're joining on a primary key, which I assumed you would be in this instance, then you're going against ordered indexes which means you'd suffer no impact from using DISTINCT. – MyItchyChin Aug 20 '12 at 19:59
@MyItchyChin no cost only if fk book.authorId is indexed. Maybe pre SQL 2000 had problems with EXISTS, but that was a long time ago. – Magnus Aug 20 '12 at 20:08
It's definitely implementation specific, once upon a time EXISITS was a bad idea. It seems that's no longer the case. – MyItchyChin Aug 20 '12 at 20:25

Nicholas Carey · Answer 3 · 2012-08-20T19:54:42.217

1

Your original query (fixed):

SELECT author.name     ,
       author.photoUrl ,
       book.title
FROM author
join book   on book.categoryId = '3'
               book.authorId   = author.id

does exactly what you asked of it: it will return one row for each book in category 3, along with the author's name and photo URL.

You can break it up into two result sets:

--
-- retrieve category 3 books
-- 
select *
from book b
where b.category = 3

--
-- retrieve related author data
--
select *
from author a
where exists ( select *
               from book b
               where b.authorId = a.id
                 and b.category = 3
             )

It's now up to you to match each book to its author, however you choose to do that. The first result set should carry the primary key for the author; the second the foreign key to the author table.

edited Aug 20 '12 at 19:54

answered Aug 20 '12 at 19:31

Nicholas Carey

71,308
16
93
135

@MyItchyChin: thanks for the edit, but `select distinct` requires sorting and deduping. The `exists` clause in this context is better performing and more clearly states the desired results. – Nicholas Carey Aug 20 '12 at 19:53
Assuming the id columns are indexed there'd be no sort necessary. In hindsight a RIGHT JOIN would alleviate the need for a DISTINCT. In older implementations of MS SQL and Oracle SQL the EXISTS sub-query was executed for every row in the query. I just ran the execution plan for both queries in MS SQL 2k8 and they're both compiled to the same query, a RIGHT SEMI JOIN. So I would say it's implementation specific but historically EXISTS sub-queries were horribly inefficient. – MyItchyChin Aug 20 '12 at 20:04
1

@MyItchyChin: it depends on the implementation in use, its query optimizer, whether the subquery is correlated or non-correlated, how much index coverage is available and what sort of index is used. A blanket statement that the `exists` clause is horribly inefficient is simply false. – Nicholas Carey Aug 20 '12 at 20:08
Yeah, I'm seeing now that modern SQL implementations optimized the queries to almost the same thing. Once upon a time the EXISTS statement caused a sub-query to be rerun for each row. – MyItchyChin Aug 20 '12 at 20:24

McGarnagle · Answer 4 · 2012-08-20T18:47:04.223

-1

You could use group concat:

SELECT author.name, author.photoUrl, GROUP_CONCAT(DISTINCT title ORDER BY title SEPARATOR ',')
FROM author
    INNER JOIN book on author.id = book.authorId
WHERE book.categoryId = '3'
GROUP BY author.name

This takes all the titles for the given author and returns them as a comma-separated string.

edited Aug 20 '12 at 18:47

answered Aug 20 '12 at 18:35

McGarnagle

101,349
31
229
260

3

Shame on you for encouraging the use of implicit joins. We should neot be encouraging people to use poor techniques. – HLGEM Aug 20 '12 at 18:37
2

@HLGEM geez man, have a cow? Why does using implicit joins make a man a million times worse than Hitler? – McGarnagle Aug 20 '12 at 18:39
2

I hardly said that and BTW I am not and never have been a man. But using implicit joins is a recognized sql antipattern and we are supposed to be showing people the CORRECT way to do things. I will always downvote anyone who uses one in an answer. – HLGEM Aug 20 '12 at 18:42
3

@dbaseman: If you want to use MySQL "trick" with `GROUP BY`, then you should `GROUP BY author.id`, not by author.name. There may be authors with same name (but different IDs). – ypercubeᵀᴹ Aug 20 '12 at 18:46
2

@HLGEM I guess I missed the memo-- what exactly is the vendetta against implicit joins? Can you point me to the argument? – McGarnagle Aug 20 '12 at 18:46
2

@dbaseman I think you are over-reacting. Hitler? Really? Do you have any consideration how much more offensive that is than being down-voted a measly 2 rep points? Wow. – Aaron Bertrand Aug 20 '12 at 18:47
2

Here is my perspective on why implicit joins are bad: https://sqlblog.org/blogs/aaron_bertrand/archive/2009/10/08/bad-habits-to-kick-using-old-style-joins.aspx – Aaron Bertrand Aug 20 '12 at 18:48
1

@HLGEM also, what is your definition of "anti-pattern"? Syntax variations that you do not prefer? – McGarnagle Aug 20 '12 at 18:48
5

@dbaseman honestly, this whole thing is erupting because of your response to the comment. hitler isn't really an appropriate comparison to the "shame on you" comment (which i'll agree could have been laced with a bit more tact). having said that, explicit joins are "standard" in ANSI-92 in favor of implicit joins. here's a SO thread from '08 which you may find interesting. http://stackoverflow.com/questions/334201/why-isnt-sql-ansi-92-standard-better-adopted-over-ansi-89 – swasheck Aug 20 '12 at 18:56
1

Im with @dbaseman on this one, the code in the question had implicit joins, and how often do you rewrite code that is not relevant to what is _actually_ being asked? Sometimes you do but I dont think it deserves a down vote if you dont. – Magnus Aug 20 '12 at 18:56
http://www.amazon.com/SQL-Antipatterns-Programming-Pragmatic-Programmers/dp/1934356557/ref=sr_1_1?s=books&ie=UTF8&qid=1345489170&sr=1-1&keywords=sql+antipatterns – HLGEM Aug 20 '12 at 18:59
2

@Magnus, you rewrite if the code they are using is an antipattern which it is. – HLGEM Aug 20 '12 at 19:00

How to optimize this query to not fetch duplicate info repeatedly

4 Answers4