0

I am trying to return the first registration for a person based on the minimum registration date and then return full information. The data looks something like this:

Warehouse_ID  SourceID  firstName  lastName  firstProgramSource  firstProgramName  firstProgramCreatedDate  totalPaid  totalRegistrations
12345         1         Max        Smith     League              Kid Hockey        2017-06-06               $100       3
12345         6         Max        Smith     Activity            Figure Skating    2018-09-26               $35        1

The end goal is to return one row per person that looks like this:

Warehouse_ID  SourceID  firstName  lastName  firstProgramSource  firstProgramName  firstProgramCreatedDate  totalPaid  totalRegistrations
12345         1         Max        Smith     League              Kid Hockey        2017-06-06               $135       4

So, this would aggregate the totalPaid and totalRegistrations variables based on the Warehouse_ID and would pull the rest of the information based on the min(firstProgramCreatedDate) specific to the Warehouse_ID.

This will end up in Tableau, so what I've recently tried ignores aggregating totalPaid and totalRegistrations for now (I can get that in another query pretty easily). The query I'm using seems to work, but it is taking forever to run; it seems to be going row by row for >50,000 rows, which is taking forever.

select M.*
from (
select Warehouse_ID, min(FirstProgramCreatedDate) First
from vw_FirstRegistration
group by Warehouse_ID
) B
left join vw_FirstRegistration M on B.Warehouse_ID = M.Warehouse_ID
where B.First in (M.FirstProgramCreatedDate)
order by B.Warehouse_ID

Any advice on how I can achieve my goal without this query taking an hour plus to run?

Kyle
  • 15
  • 3
  • What is the data type of the `firstProgramCreatedDate` column, and is there by any chance an index for `(Warehouse_ID, firstProgramCreatedDate)`? – Joel Coehoorn Mar 18 '19 at 17:26
  • @JoelCoehoorn firstProgramCreatedDate is datetime data type. There is not an index right now, no. Should I go that direction? I wasn't sure about it because every help post I've seen was aiming to use the index to compare a range of dates. – Kyle Mar 18 '19 at 17:39
  • Why are you grouping by Warehouse_ID when you're trying to query a person's MIN firstprogramdate? Also- if instead you choose to group by their name, then it's a problem once you have a second person that applies with the same name. – Zorkolot Mar 18 '19 at 17:46
  • @Zorkolot I'm grouping by Warehouse_ID because it is the unique identifier across source. There are five sources, so I want a person's MIN firstProgramCreatedDate regardless of source. – Kyle Mar 18 '19 at 17:54
  • @Kyle While that explains what Warehouse_ID is... you asked _"I am trying to return the first registration for a person"_. The query you provided does nothing of the sort with respect to a particular person, it groups by Warehouse_ID. – Zorkolot Mar 18 '19 at 18:32
  • @Zorkolot I'm sorry, I believe I have been a little unclear. Warehouse_ID is a unique identifier for each particular person. The query from Eric below works, though, so I believe I have my answer! – Kyle Mar 18 '19 at 18:39

2 Answers2

0

Try to use ROW_NUMBER() with PARTITIYION BY.

For more information please refer to: https://learn.microsoft.com/en-us/sql/t-sql/functions/row-number-transact-sql?view=sql-server-2017

0

A combination of the ROW_NUMBER windowing function, plus the OVER clause on a SUM expression should perform pretty well.

Here's the query:

SELECT TOP (1) WITH TIES
  v.Warehouse_ID
  ,v.SourceID
  ,v.firstName
  ,v.lastName
  ,v.firstProgramSource
  ,v.firstProgramName
  ,v.firstProgramCreatedDate
  ,SUM(v.totalPaid) OVER (PARTITION BY v.Warehouse_ID) AS totalPaid
  ,SUM(v.totalRegistrations) OVER (PARTITION BY v.Warehouse_ID) AS totalRegistrations
FROM
  @vw_FirstRegistration AS v
ORDER BY
  ROW_NUMBER() OVER (PARTITION BY v.Warehouse_ID 
    ORDER BY CASE WHEN v.firstProgramCreatedDate IS NULL THEN 1 ELSE 0 END,
             v.firstProgramCreatedDate)

And here's a Rextester demo: https://rextester.com/GNOB14793

Results (I added another kid...):

+--------------+----------+-----------+----------+--------------------+------------------+-------------------------+-----------+--------------------+
| Warehouse_ID | SourceID | firstName | lastName | firstProgramSource | firstProgramName | firstProgramCreatedDate | totalPaid | totalRegistrations |
+--------------+----------+-----------+----------+--------------------+------------------+-------------------------+-----------+--------------------+
|        12345 |        1 | Max       | Smith    | League             | Kid Hockey       | 2017-06-06              |    135.00 |                  4 |
|        12346 |        6 | Joe       | Jones    | Activity           | Other Activity   | 2017-09-26              |    125.00 |                  4 |
+--------------+----------+-----------+----------+--------------------+------------------+-------------------------+-----------+--------------------+

EDIT: Changed the ORDER BY based on comments.

Eric Brandt
  • 7,886
  • 3
  • 18
  • 35
  • Yep, this seems to be perfect! Thank you so much! – Kyle Mar 18 '19 at 18:22
  • Actually, I realize I still have one bug with this that I was hitting right before seeing your response. Some people have NULL for their firstProgramCreatedDate at one source, while a date for another. With the ORDER BY as is, it ends up with the NULL registrations being considered the minimum. I'd theoretically not like to simply add **WHERE firstProgramSouce <> 'no registrations'** because I'd like to include people with no registrations as a whole. Thoughts? – Kyle Mar 18 '19 at 18:47
  • Credit to Martin Smith for this answer: https://stackoverflow.com/a/5886975/5790584 – Eric Brandt Mar 18 '19 at 19:03
  • Change the order by thusly: `ORDER BY CASE WHEN v.firstProgramCreatedDate IS NULL THEN 1 ELSE 0 END,v.firstProgramCreatedDate`. Rextester: https://rextester.com/VPH11266 – Eric Brandt Mar 18 '19 at 19:04