COUNT of DISTINCT ids NOT IN subquery?

Question

Ok, I'm trying to get a count of users who bought something in a given month but haven't bought something already in any of the previous months (as in, figure out how many new users are there each month), so this is what I'm kinda trying to do but it's obviously not working:

SELECT COUNT(DISTINCT user_id NOT IN (
    SELECT user_id
    FROM payment
    WHERE amount > 0
    AND MONTH(payment_date) < "10"
    GROUP BY user_id
) AS new_users_count
FROM payment
WHERE amount > 0
AND MONTH(payment_date) >= "10"

If I use "DISTINCT" it returns 0 count.

If I take it out, it returns all the users, new and old.

Plus having a subquery in COUNT() takes a while to process.

Any way this can be done with SUM(IF...)? Or any other more optimal way?

I've looked everywhere for an idea/solution, just can't figure it out.

score 0 · Answer 1 · answered Oct 26 '15 at 09:32

Put the criteria in your WHERE clause. Use NOT EXISTS or NOT IN.

select count(distinct user_id)
from payment
where amount > 0
and month(payment_date) >= 10
and user_id not in
(
  select user_id
  from payment
  where amount > 0
  and month(payment_date) < 10
);

And here is a alternative with conditional aggregation:

select count(*)
from
(
  select user_id
  from payment
  where amount > 0
  group by user_id
  having sum(month(payment_date) >= 10) > 0
  and sum(month(payment_date) < 10) = 0
) t;

score 0 · Answer 2 · edited May 23 '17 at 10:27

Use NOT EXISTS to verify that no payment exists for the same user in the previous month.

SELECT COUNT(DISTINCT tm.USERID)
FROM payment tm
WHERE tm.amount > 0
AND MONTH(tm.payment_date) >= "10"
AND NOT EXISTS
  ( SELECT 'x' 
    FROM payment lm
    WHERE lm.amount > 0
    AND MONTH(lm.payment_date) < "10"
    AND lm.user_id = tm.user_id
  )

PS: Note that this query will not work well anymore once it becomes 2016! Below is a fix for that. It cross joins a calculation to get the first day of the month. This value (DATEPIVOT) is used to get the payment after or before that day. For the calculation, I used the answer by Aleroot.

Another advantage of this notation, is that MySQL should be able to better utilize any indexes on payment.payment_date, so as your data grows larger, this query will suffer less from performance decrease.

SELECT COUNT(DISTINCT tm.USERID)
FROM payment tm
CROSS JOIN 
    (SELECT 
       DATE_SUB(CURRENT_DATE, INTERVAL DAYOFMONTH(CURRENT_DATE)-1 DAY) as DATEPIVOT) d
WHERE tm.amount > 0
AND tm.payment_date >= d.DATEPIVOT
AND NOT EXISTS
  ( SELECT 'x' 
    FROM payment lm
    WHERE lm.amount > 0
    AND lm.payment_date < d.DATEPIVOT
    AND lm.user_id = tm.user_id
  )

jkavalik · Answer 3 · 2015-10-26T09:55:22.257

COUNT(DISTINCT user_id NOT IN (..something..)

is understood as

COUNT(DISTINCT (user_id NOT IN (..something..))

and NOT IN is a boolean operator returning true/false (=1/0)

The condition does not make sense in the COUNT() - you probably wanted to do it in the WHERE clause. Or you might use SUM() and GROUP BY.

But not using subquery at all is usually better (MySQL 5.6+ can handle subqueries well, but yours is "dependent" one). Something like this should work:

SELECT COUNT(DISTINCT p1.user_id)
FROM payment p1
LEFT JOIN payment p2
  ON p1.user_id = p2.user_id
     AND p2.amount > 0
     AND MONTH(p2.payment_date) < 10
WHERE p1.amount > 0
  AND MONTH(p1.payment_date) >= 10
  AND p2.user_id IS NULL;

The left join will join any existing row for the same user_id which ordered something "before", so you just check with p2.user_id IS NULL that no such "previous order" exists for this user_id. Then you count the results - as each user can be returned multiple times depending on the number of payments, the DISTINCT takes care of that.

Some notes

MONTH(datetime) returns number so I removed the quotes around 10
you need an index on user_id to make it work in any reasonable time and preferably composite index on (user_id, amount, payment_date) to turn it into and index scan for the join
unfortunately MONTH(p1.payment_date) >= 10 is not an indexable condition and moreover it will change in effect next year, so you should probably use full date in the comparison, like p1.payment_date >= '2015-10-01' (and the same for p2) and then add an index on (payment_date, user_id, amount) to make the FROM part fast too (only users having at least "new" payment will be checked, old users with no activity will not be rechecked all the time).

score -1 · Answer 4 · answered Oct 26 '15 at 09:35

-1

I am not sure but why don't you use the DISTINCT after the SELECT?

SELECT DISTINCT  COUNT(user_id NOT IN

answered Oct 26 '15 at 09:35

IBuzZBasHI

21
1
4

This will only do the "distinct filtering" after everything is counted, so it will return only one number which is unique by definition. – jkavalik Oct 26 '15 at 09:39

COUNT of DISTINCT ids NOT IN subquery?

4 Answers4