Simple SQL query is taking 20 minutes to run?

Question

I have a query that outputs a list of percentages based on a total number, the only part I cant figure out is an efficient method to filter the 'usid' equal to a value on another table.
The query is not failing but is taking a very long time to complete.

    SELECT badge, count(usid)*100 / (SELECT COUNT(DISTINCT usid) from Table1)
    FROM Table1
    WHERE usid IN(
        SELECT usid
        FROM Table2
        WHERE msid = 1
        )
    GROUP BY badge

The output looks something like this

    -----------------------------
    badge        count
    -----------------------------
    1            65.1
    2            45.4
    3            22.7
    4            12.12

The usid that it is counting I am trying to set equal to the usid WHERE msid = 1. Even if this method works it takes far too long. any ideas for a work around?

There are some standard questions: which DBMS, and what is the schema of the tables, what are the sizes (row count) of the table, and (in this case), what indexes are on the relevant columns of the table. — Jonathan Leffler, Aug 21 '14 at 14:52
Using a window function might be faster: `SELECT badge, count(usid)*100 / COUNT(DISTINCT usid) over ()` — , Aug 21 '14 at 15:03

score 0 · Accepted Answer · edited May 23 '17 at 10:33

0

You should be able to use explicit JOIN notation instead of the IN clause:

SELECT a.badge, COUNT(a.usid)*100 / (SELECT COUNT(DISTINCT usid) from Table1)
  FROM Table1 AS a
  JOIN (SELECT DISTINCT usid FROM Table2 WHERE msid = 1) AS b ON a.usid = b.usid
 GROUP BY a.badge

However, I'm not confident that will fix the performance problem. A half-way decent optimizer will realize that the sub-select in the select-list is constant, but you should verify that the optimizer is half-way decent (or better) by looking at the query plan.

I'm not convinced that the COUNT(a.usid) does anything different from COUNT(*) in this context. It would produce a different answer only if a.usid could contain nulls. See also COUNT(*) vs COUNT(1) vs COUNT(pk) — which is better?

edited May 23 '17 at 10:33

Community

1
1

answered Aug 21 '14 at 14:55

Jonathan Leffler

730,956
141
904
1,278

This will not necessarily return the same results, because `table2` might have duplicate rows for any given `usid`. – Gordon Linoff Aug 21 '14 at 14:57
@GordonLinoff: you were right — rewritten to avoid that problem. – Jonathan Leffler Aug 21 '14 at 15:01
Worked like a charm actually, runs in .17 seconds as opposed to what I was getting. Thanks – user3385236 Aug 21 '14 at 15:21

score 0 · Answer 2 · answered Aug 21 '14 at 14:55

This is not such a simple query. Depending on the database you are using, the in might be quite inefficient and each output row is calculating the count(distinct). Try rewriting the query as:

SELECT badge, count(usid)*100 / x.cnt
FROM Table1 t1 cross join
     (SELECT COUNT(DISTINCT usid) as cnt from Table1) x
WHERE exists (select 1
              from table2 t2
              where t2.usid = t1.usid and t2.msid = 1
             )
GROUP BY t1.badge, x.cnt;

This query will probably be faster, regardless of the database you are using.

By the way, it is suspicious that you are calculating count(usid) and then dividing by count(distinct usid). I would expect either both or neither to be count(distinct).

The result im getting is strange. each usid can have all, a few, or none of the badges. So the result shows the % of members who have each badge, based on the total count of usid's — user3385236, Aug 21 '14 at 15:08
@user3385236 . . . And what is strange about your results? This should be returning the same thing as your original query, hopefully with improved performance. — Gordon Linoff, Aug 21 '14 at 15:19

score 0 · Answer 3 · answered Aug 21 '14 at 14:57

0

General rules of thumb on speeding up sql:

only return the minimum fields needed
use paging-- so you supply an offset and limit, and get a page of data
OR, cap returned data at some reasonable cutoff. (you will only see the first 500 results for a search, then the user needs to refine the search parameters) OTHERWISE, someone can run off an open-ended query and put extreme load on the system.
avoid IN statements
avoid nested queries
add indexes on joined/searched fields (in the order they are listed in the query)
Use numbers rather than strings if possible
Avoid joins if not needed (you may also denormalize the database)
If possible precompute information (like sums) and store those in another table or field. These can be updated on insert/update events of related data.

answered Aug 21 '14 at 14:57

Kevin Seifert

3,494
1
18
14

Denormalizing the database is a last resort; absolutely not recommended until you've pursued every other avenue, including redesigning the schema. – Jonathan Leffler Aug 21 '14 at 15:03
I agree... that's why it's at the bottom of the list :) – Kevin Seifert Aug 21 '14 at 15:06
Noted, good advice. I tend to follow most of these but not completely – user3385236 Aug 21 '14 at 15:17

Hlin · Answer 4 · 2014-08-21T15:29:52.477

0

Can you try this:

declare @userIDcnt as int
select @userIDcnt = COUNT(DISTINCT usid) from Table1

SELECT badge, count(t1.usid)*100 / @userIDcnt
FROM Table1 t1
inner join Table2 t2 on t1.usid = t2.usid and t2.msid = 1
GROUP BY badge

edited Aug 21 '14 at 15:29

answered Aug 21 '14 at 15:01

Hlin

134
4

This is what I used originally, just returns "column 'usid' in field is ambiguous. – user3385236 Aug 21 '14 at 15:06
Fix the ambiguity by writing `COUNT(t1.usid)`, as `COUNT(usid)` is the ambiguous reference to `usid`. – Jonathan Leffler Aug 21 '14 at 15:08
Jonathan is right. Also make sure that you have an index on usid field on both tables. – Hlin Aug 21 '14 at 15:14
1

@Hlin: note the critique Gordon Linoff made against the first version of my answer (you'll need to click on the 'edited' timestamp to see the older revisions). The same comment applies to your answer. – Jonathan Leffler Aug 21 '14 at 15:24

Simple SQL query is taking 20 minutes to run?

4 Answers4