0

I'm attempting to create a SQL statement that does the following but obviously failing at it.

Selects all the information from a view and filters it based on the following criteria:

  1. If the given field "customer_number" exists at any point in the table
  2. Then check for any duplicates within the last 30 days
  3. Then output all distinct customer_numbers grouped by customer_number
  4. Have a count so that I can see how many times an item was repeated.

So using today as our current date say I had the following data entries in my table:

customer_number, date
111111, 2013-01-01
111111, 2013-05-05
222222, 2013-01-07
222222, 2013-03-19
333333, 2013-05-15
333333, 2013-05-30

I would want my SELECT statement to return the following:

customer_number, date, count
111111, 2013-01-01, 2
333333, 2013-05-15, 2

And customer 222222 would not be displayed because he does not appear in the last 30 days.

Here's about as far as I've gotten trying to figure this out. It only returns data for the current month.

SELECT *, COUNT(customer_number) 
FROM red_flags 
WHERE dealer_id = '75' AND date BETWEEN CURDATE() - INTERVAL 30 DAY AND CURDATE() 
GROUP BY customer_number 
HAVING COUNT(customer_number) > 1

I've been melting my brain trying to figure out how or if this is even possible to do in a query any help would be greatly appreciated!

bmanhard
  • 149
  • 1
  • 3
  • 12

2 Answers2

1

Something like this should work for you -- it's using a subquery to find out which customers have been flagged in the last 30 days. Then it uses COUNT to get the total and MIN to get the first date.

SELECT customer_number, COUNT(1), MIN(date) minDate
FROM red_flags 
WHERE customer_number IN (
  SELECT customer_number
  FROM red_flags
  WHERE dealer_id = '75' 
      AND date BETWEEN CURDATE() - INTERVAL 30 DAY AND CURDATE() 
  )
GROUP BY customer_number 

And if you only care about those with multiple records, you can add the HAVING clause back.

BTW, be careful using BETWEEN with dates. I prefer to be explicit and use >= and <= for more control.


EDIT -- perhaps the easiest solution, use COUNT with CASE:

SELECT customer_number, COUNT(1), MIN(date) minDate
FROM red_flags 
GROUP BY customer_number 
HAVING COUNT(CASE WHEN date BETWEEN CURDATE() - INTERVAL 30 DAY AND CURDATE() THEN 1 END) > 0 
sgeddes
  • 62,311
  • 6
  • 61
  • 83
  • Seems to be working, it's taking my database ~45 seconds to parse this query though, is there any way to speed it up at all? – bmanhard May 31 '13 at 19:46
  • @bmanhard -- wow, that's a long time. Do you have an index on the customer_id field? – sgeddes May 31 '13 at 19:46
  • I also do only care about items with multiple records and I added the HAVING COUNT(customer_number) > 1 in below the GROUP BY and it's no longer returning records. – bmanhard May 31 '13 at 19:48
  • You should have an index on customer_number and another index on dealer_id + date – Joe Meyer May 31 '13 at 19:49
  • @bmanhard -- adding the HAVING clause should work fine -- see this updated fiddle: http://sqlfiddle.com/#!2/55a3d/6 – sgeddes May 31 '13 at 19:49
  • @sgeddes -- I haven't done anything outside of creating the tables. I'll try googling it to see if I've done that correctly. I assumed indexing was done on table creation. – bmanhard May 31 '13 at 19:50
  • @sgeddes -- Nice, is there a way to get the mindate into the format of yyyy-mm-dd? – bmanhard May 31 '13 at 19:52
  • Mysql is very bad at optimizing WHERE IN. Use a JOIN instead. – Barmar May 31 '13 at 19:53
  • @bmanhard -- More fiddle -- Index creation example and Date_Format example: http://sqlfiddle.com/#!2/66f05/6 – sgeddes May 31 '13 at 19:54
  • @sgeddes -- Nvm I'm dumb and didn't realize your fiddle was set as a datetime. Anyways it seems to work fine on your fiddle but when I add that having cause to my SQL I run on my database it returns nothing. Removing it works fine though. – bmanhard May 31 '13 at 19:55
  • @sgeddes -- Could you recommend me a good read on better understanding Indexes? Should I just be using that same code you gave me on the fiddle on all fields I use for heavy searching? – bmanhard May 31 '13 at 20:02
  • Running the code on the index's also did nothing to speed up the query time still 45 seconds. – bmanhard May 31 '13 at 20:05
  • @bmanhard -- Have you tried using the JOIN syntax Barmar posted? I am unaware of any limitations of performance between IN and JOIN in the later version of MySQL, but perhaps it would work better for your needs. How many records are in your table? Play around with the query to see what helps improve the performance. Try removing the date criteria from your WHERE clause (just for testing). You may need to add an index there as well. Best regards. – sgeddes May 31 '13 at 20:09
  • @sgeddes I did try using the join and it cut it down significantly. I fixed the error with it not returning results my view was messed up. Thanks again for the help! – bmanhard May 31 '13 at 20:30
  • @bmanhard -- no worries, glad you got it working. BTW -- I did just edit my response with what I think is actually the best solution. Not sure why it didn't occur to me earlier -- requires no JOIN/IN at all. Still not sure why MySQL has such bad performance with IN... Best regards! – sgeddes May 31 '13 at 21:47
1

Here's how to do it using a JOIN

select r.customer_number, count(*) ct, min(date) minDate
from red_flags r
join (select distinct customer_number
      from red_flags
      where dealer_id = '75'
      and date between CURDATE() - INTERVAL 30 DAY and CURDATE()) x
ON r.customer_number = x.customer_number
group by customer_number
HAVING ct > 1

SQLFIDDLE

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Why is MySQL so bad with optimizing IN? I didn't realize this before reading more about it. Thanks for bringing it to my attention. BTW -- I think what I edited might be the best solution. Kind regards! – sgeddes May 31 '13 at 21:57
  • @sgeddes Take a look at EXPLAIN -- it tends to use the wrong index. Instead of selecting items in the subquery and then indexing them in the main table, it scans the entire main table and then indexes into the subquery. – Barmar May 31 '13 at 22:04
  • @Barmar -- yeah, I spent some time reading about it and reviewing different execution plans. Coming from SQL Server where IN is generally better (or equivalent), I didn't realize MySQL's optimizer had these limitations. Seems like JOIN/DISTINCT would be slower than IN by itself. Here's a good SQL Server SO post about MSSQL: http://stackoverflow.com/a/1200337/1073631. Anyhow, thanks again. – sgeddes May 31 '13 at 22:12