SQLServer count() over() with distinct

Question

I'm working on a project, where we need to count the number of distinct rows. A simplified version of the scenario includes a user table, a keyword table and a keyword_user table.

The user table just includes common user meta data, like name etc. The other tables are listed below.

keyword_user:

id
user_id
keyword_id

keyword:

id,
description

What I want to do, is to find a max number of users (5), based on the users keyword_id's, while also counting the total number of matching rows. The count must be distinct.

The query:

SELECT TOP 5 u.[id], 
             u.[firstname], 
             u.[lastname], 
             total = Count(*) OVER() 
FROM   [user] u 
       INNER JOIN [keyword_user] ku 
               ON u.[id] = ku.[user_id] 
WHERE  ( ku.keyword_id IN ( '5f6501ec-0a71-4067-a21d-3c5f87a76411', 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')) 
       AND u.id NOT IN ( '12db3001-b3b9-4626-8a02-2519102cb53a' )

Resultset:

+--------------------------------------+-----------+----------+-------+
|                  id                  | firstname | lastname | total |
+--------------------------------------+-----------+----------+-------+
| F0527AC3-747A-45A6-9CF9-B1F6C7F548F8 | Kasper    | Thomsen  |     3 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     3 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     3 |
+--------------------------------------+-----------+----------+-------+

PROBLEM:

The problem being here, that Michael is counted twice, and the total count therefore is 3, when i want it to be 2. When using count() over() you cannot parse an expression into it, that contains distinct. Also, if I just SELECT DISTINCT, my resultset looks fine, besides the total count, which would still be 3.

If I need to include more information to support the question, please let me know, and I will try to answer the best I can.

MSSQL CREATE DB SCRIPT (SAMPLE DATA)

example_data.sql

wanted resultset:

+--------------------------------------+-----------+----------+-------+
|                  id                  | firstname | lastname | total |
+--------------------------------------+-----------+----------+-------+
| F0527AC3-747A-45A6-9CF9-B1F6C7F548F8 | Kasper    | Thomsen  |     2 |
| 95988F6D-9C91-4779-B6C3-3D4B4D6AE836 | Michael   | Jacobsen |     2 |
+--------------------------------------+-----------+----------+-------+

Also please notice that i have added a script with sample data — Teilmann, Oct 16 '15 at 08:39
@ThomasTeilmann, please create [SqlFiddle](http://sqlfiddle.com) with the schema, sample data and expected result instead of posting a link to dropbox. At the moment the expected result set that you posted in the question doesn't match the sample data. — Vladimir Baranov, Oct 16 '15 at 09:44
@VladimirBaranov if you take the provided query and run it against the database, created with the script i have linked, it produces an almost identical resultset, with only the users last names being different. Ofc. the user id's are also different, seeing as they are randomly generated.. — Teilmann, Oct 16 '15 at 09:48
@ThomasTeilmann, I don't see `id=F0527AC3-747A-45A6-9CF9-B1F6C7F548F8` or `id=95988F6D-9C91-4779-B6C3-3D4B4D6AE836` in the sample data, that's why I'm saying that they don't match. — Vladimir Baranov, Oct 16 '15 at 09:51

ken lacoste · Answer 1 · 2015-10-17T07:45:29.817

1

i'm a little confused in your case specially with the "keywords" and how they're related per user (this is just a process issue for me) thus found myself doing an outer group by containing your initial query as my source table.

Please comment below so we can improve this.

SELECT 
    id
    , firstname
    , lastname
    , total
    , COUNT(*) AS [per_user_count]
FROM (
       SELECT TOP 5 u.[id], 
             u.[firstname], 
             u.[lastname], 
             total = Count(*) OVER() 
       FROM [user] u 
       INNER JOIN [keyword_user] ku 
           ON u.[id] = ku.[user_id] 
       WHERE 
       ( 
           ku.keyword_id IN (
              '5f6501ec-0a71-4067-a21d-3c5f87a76411'
              , 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf'
           )
       ) 
       AND u.id NOT IN ('12db3001-b3b9-4626-8a02-2519102cb53a')
) AS T
GROUP BY
    T.id
    , T.firstname
    , T.lastname
    , T.total

EDIT : We really had a confusion there, then so I created a simpler script that shall exclude the keywords, just the unique users (to generate the total overall) and get the top 5 of it (random order).

    SELECT 
        TOP 5
        T.id
        , T.firstname
        , T.lastname
        , Total = COUNT(*) OVER()
    FROM (
        SELECT DISTINCT
            u.*
        FROM [keyword_user] ku
        LEFT JOIN [user] u
        ON
            ku.user_id = u.id
        WHERE 
        ( 
            ku.keyword_id IN (
                '5f6501ec-0a71-4067-a21d-3c5f87a76411'
                , 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')
        ) 
        AND ku.[user_id] NOT IN ( 
                '12db3001-b3b9-4626-8a02-2519102cb53a' 
            )
    ) AS T

Thanks

EDIT : Your scenario is a straight forward "search of keywords" that is linked to an entity, with an overall count and a top 5 result. As how I understand the CTE (and based in MSDN), CTE is a great solution for hierarchical data mining (no need to do while and do whatever back-flip to get your organizational hierarchy) which doesn't really fit the scenario that we have here.

edited Oct 17 '15 at 07:45

answered Oct 16 '15 at 08:39

ken lacoste

894
8
22

Yes i understand why you are confused by it, since the original database design is quite large. That is why i tried to make a simplified version of it :) I will try your suggestion out. Give me a minute :) – Teilmann Oct 16 '15 at 08:40
Aah you included a count, on each row, showing how many times the user has been selected, on behalf of how many of the keywords he has. Unfortunately, that is not really what im interested in. I want a count that reflects how many users in the database that has any of those keywords, but only a maximum of 5 returned. I know its a little hard to grasp, but it is also hard to explain :) So it is not an individual count, but kind of a global count. And it must be distinct. – Teilmann Oct 16 '15 at 08:49
Actually, im only interested in the total count, but it is still not distinct. (It returns 3 on all the rows, when it should return 2 in this case.) I have added a db create script with some sample data, if you would like a challenge :) – Teilmann Oct 16 '15 at 08:52
distinct KEYWORD per user? or distinct USER PER keyword? – ken lacoste Oct 16 '15 at 08:53
Neither :) I know it's a bit wierd, but what i want is a number, somehow, of how many users (total) in the database that has ANY of those keywords. It could be 77, while still returning 2 rows. Those two rows would still have 77 as total count. Please ask more questions if need be. I'll try to answer :) – Teilmann Oct 16 '15 at 08:55
The resultset cannot contain and must not count the same user twice. I does that, because michael has both keywords, and then counts him twice. – Teilmann Oct 16 '15 at 08:57
So the users should be distinct somehow. Problem is, that if i just add SELECT DISTINCT it doesn't affect the count() over() – Teilmann Oct 16 '15 at 08:57
How can a total keyword reach 77 if I only have about 2 keyword records? I mean i've already tried mutating each of your table and it only reached 18 maximum combinations. try doing (select * from user, keyword, keyword_user) sorry for so many questions. – ken lacoste Oct 16 '15 at 09:02
It was theoretical: It depends of the number of USERS. if the keyword_user table hold 77 rows with a reference to a user and the same keyword, then one keyword is referenced by 77 different users, IE. 77 users has the same keyword. If i query for users that has this keyword, i want to get 5 users, with a total count of 77. Does this make sense? – Teilmann Oct 16 '15 at 09:05
Totally makes sense, but how do you prefer to search for the keyword now, by ID or by description of the keyword? because based on your initial query, you're using keyword_id. – ken lacoste Oct 16 '15 at 09:12
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/92466/discussion-between-thomas-teilmann-and-ken-lacoste). – Teilmann Oct 16 '15 at 09:14

score 1 · Accepted Answer · answered Oct 16 '15 at 10:12

1

You really should explain what you need in the question, not in comments.

In CTE_Users we find all distinct users for the given keywords. Then join the result with user to get user details. At least it produces the result that you expect with the given small sample data.

WITH
CTE_Users
AS
(
    SELECT DISTINCT ku.user_id
    FROM
        keyword_user AS ku
    WHERE
        ku.keyword_id IN (
            '5f6501ec-0a71-4067-a21d-3c5f87a76411', 
            'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')
        AND ku.user_id NOT IN (
            '12db3001-b3b9-4626-8a02-2519102cb53a')
)
SELECT TOP(5)
    u.id
    ,u.firstname
    ,u.lastname
    ,COUNT(*) OVER() AS total
FROM
    user AS u
    INNER JOIN CTE_Users ON CTE_Users.user_id = u.id
;

answered Oct 16 '15 at 10:12

Vladimir Baranov

31,799
5
53
90

Im gonna try this out, but for the record, i think i stated it pretty clear in the question: "What I want to do, is to find a max number of users (5), based on the users keyword_id's, while also counting the total number of matching rows. The count must be distinct.". Your answer seems pretty good, but i cannot test it before an hour or so. Thanks – Teilmann Oct 16 '15 at 10:16
1

This is the problem. You think that you stated it clearly, but only after I read your comment to Nguyễn with example of 77 users I started to understand a little what you are talking about. You should have included these explanations in the question. – Vladimir Baranov Oct 16 '15 at 10:23
I have a question: If i wanted to join on more tables to further filter the users, where should i put it for best practice? (I need to join on the user id (u.id) – Teilmann Oct 16 '15 at 10:55
Err...... It depends... `CTE_Users` should contain the final filtered list of distinct user IDs, so it makes sense to add extra filtering there. – Vladimir Baranov Oct 16 '15 at 11:03
Thanks, your answer was the one that worked best for me, and i have accepted it as the answer. Nice work! – Teilmann Oct 16 '15 at 13:21
Well..hopefully ur users table dont go beyond 32727..thats the max recursion possible for CTE.. – ken lacoste Oct 16 '15 at 18:01
I really doubt that it will get that far, but it might have been worth mentioning it in the answer, before i implemented your solution in the project. ;) – Teilmann Oct 16 '15 at 22:29
@kenlacoste, err... what recursion are you talking about? I admit, I haven't read all comments under your answer, but the query in my answer is not using any recursion. – Vladimir Baranov Oct 16 '15 at 23:51
got it, see my edit. thanks. @VladimirBaranov a recursion is a loop. CTE is designed to build for "hierarchical" query, say you have an organization where the loop starts from the CEO, then splits down to its 2 employees, and that 2 employees has 4, has 8, etc. Loop limit were also made because of the possibility of an infinite tree. To broaden the explanation of an infinite tree, see this stackoverflow problem http://stackoverflow.com/questions/15080922/infinite-loop-cte-with-option-maxrecursion-0 – ken lacoste Oct 17 '15 at 07:44
@kenlacoste, I know how to make recursive CTE and in what cases it is useful and needed. It seems that you think that CTE are used only for recursion, but CTE doesn't have to be recursive. It is just a temporary result set that can be used further in the query. It can be a convenient way to write complex queries that is easy to read. [Using Common Table Expressions](https://technet.microsoft.com/en-us/library/ms190766(v=sql.105).aspx) , [WITH common_table_expression](https://msdn.microsoft.com/en-AU/library/ms175972.aspx) – Vladimir Baranov Oct 17 '15 at 08:12
Oh, I see, you used the CTE as a substitute to a #TEMP without the usage of self-referencing to avoid recursion, clever idea but that surely consumes larger consumption as it's NOT being treated as a table in tempdb. My sub-query in the other hand is an in-line query that cannot be reused, thus will not need much execution plans. See this performance post here. http://sqlblogcasts.com/blogs/tonyrogerson/archive/2008/05/17/non-recursive-common-table-expressions-performance-sucks-1-cte-self-join-cte-sub-query-inline-expansion.aspx – ken lacoste Oct 17 '15 at 08:54
@kenlacoste, I can only recommend you to re-read the post that you linked. It says that SQL Server doesn't materialize CTE, it in-lines CTE. Exactly what you are doing manually with your sub-query. In this answer CTE is just a convenient syntax that makes the query easier to read. If you replace CTE with a sub-query here the resulting execution plan would not change. – Vladimir Baranov Oct 17 '15 at 09:36
hey Thomas, as discussed here, careful with the CTE, no joins inside the CTE ok? it will affect your performance greatly, keep it simple :). @VladimirBaranov hmm, seems its about having anchors / joins INSIDE a CTE and not with those simple ones (like yours). Either way our answer would still yield same results. :) nice time though, thanks! – ken lacoste Oct 17 '15 at 10:04

Joe Taras · Answer 3 · 2015-10-16T09:11:36.307

Try this: I've create two temporary tables (#user and #user_key) and populate them.

create table #user (id int, name varchar(20))

create table #user_key (id int, fk_user int, content varchar(50))

insert into #user values 
(1, 'Giuseppe'),
(2, 'Anna'),
(3, 'Angela'),
(4, 'Maria'),
(5, 'Ethra'),
(6, 'Piero')

insert into #user_key values 
(1, 1, 'ciao'),
(2, 1, 'hello'),
(3, 2, 'hallo'),
(4, 4, 'hullo')

The extract query:

I use as main table #user, so I add a subquery about total count but in the order by clause I try to order users about their keywords. You can add other conditions (as your In / NOT IN)

select top 5 id, name, (select COUNT(*) from #user_key uk)
from #user u
order by (select COUNT(*) from #user_key uk where uk.fk_user = u.id) desc

Go to SqlFiddle

EDIT

You want this?:

97D476C2-B52C-4D44-A460-44472CBF8817    Michael testing 2
F4FE5550-BC69-437E-91A0-5B11E0D9279E    Kasper  Test    2

or this?

97D476C2-B52C-4D44-A460-44472CBF8817    Michael testing 2
F4FE5550-BC69-437E-91A0-5B11E0D9279E    Kasper  Test    2
12DB3001-B3B9-4626-8A02-2519102CB53A    Thomas  Teil    2

Unfortunately this is not going to work as intended. The count will show how many times the user has been selected, i.e. how many of the keywords that have been matched for the single user. Example: If michael has both the keywords included in the Where clause, the count on that row would be 2. I need a total count of all users that matches the two keywords, but only select 5 of them. The counter could for example be 77. — Teilmann, Oct 16 '15 at 08:12
Im sorry, but im having a hard time connecting your suggested answer with my stated scenario :/ It seems that you dont fully understand the problem im facing. It's really hard to explain. Could you take a look at the comments i have wrote for Ken Lacostes answer? — Teilmann, Oct 16 '15 at 08:59

Nguyễn Hải Triều · Answer 4 · 2015-10-16T09:43:57.813

0

You can try:

    SELECT TOP 5 * FROM (
    SELECT
        u.[id], 
        u.[firstname], 
        u.[lastname], 
        total = Count(*) OVER(PARTITION BY ku.keyword_id),
        rownum = ROW_NUMBER() OVER(PARTITION BY ku.keyword_id ORDER BY u.ID)
    FROM   [user] u 
    INNER JOIN [keyword_user] ku 
        ON u.[id] = ku.[user_id] 
    WHERE (ku.keyword_id IN ( '5f6501ec-0a71-4067-a21d-3c5f87a76411', 'c19b95c0-8554-4bbd-9526-db8f1c4f1edf')) 
           AND u.id NOT IN ( '12db3001-b3b9-4626-8a02-2519102cb53a' )
    ) AS A ORDER BY A.rownum DESC

edited Oct 16 '15 at 09:43

answered Oct 16 '15 at 08:14

Nguyễn Hải Triều

1,454
1
8
14

If i put that expression in the over clause, it will only count one up pr. keyword that matches the keywords in the where clause. Ie. Michaels count would be 2 and Kaspers count would be 1. The idea here is to get the total number of users that could possibly match these given keywords, but only selecting max. 5 of them, while still getting the total count. :) – Teilmann Oct 16 '15 at 08:17
Can you give me script of sample data? – Nguyễn Hải Triều Oct 16 '15 at 08:20
yes i can, just give me a couple minutes to create it. – Teilmann Oct 16 '15 at 08:21
There, i have linked to a db script in the bottom of my post. :) – Teilmann Oct 16 '15 at 08:34
And what's the result you want to? – Nguyễn Hải Triều Oct 16 '15 at 09:18
1

I want to select 5 users, that has either of those keywords in the where clause. Each user must have a count that reflects how many users in the db matches these keywords, while still returning 5 rows. Example: there could be 77 users that has chosen the same keyword that we are including in our search. Therefore we should still get a maximum of 5 users, while the count reflects how many total users has this keyword. (77). Please read the comments i have written for Ken Lacostes answer :) – Teilmann Oct 16 '15 at 09:25
Edited. you can try this. – Nguyễn Hải Triều Oct 16 '15 at 09:44
Im sorry, this doesnt work as i want it to either. Im not sure that you actually understand my question – Teilmann Oct 16 '15 at 09:49
Cause I'm bad english, plz show me your result table solutions – Nguyễn Hải Triều Oct 16 '15 at 10:03

SQLServer count() over() with distinct

4 Answers4