18

I have a table that contains customers information. Each customer is assigned a Customer ID (their SSN) that they retain as they open more accounts. Two customers may be on the same account, each with their own ID. The account numbers are not ordered by date.

I would like to find the most recent account of each customer or group of customers. If two customers have ever been on an account together, I want to return the most recent account either customer has been on.

Here is a sample table with some of the possible cases.

Example table ACCT:

acctnumber  date            Cust1ID     Cust2ID 
10000       '2016-02-01'    1110        NULL    --Case0-customer has only ever had
                                                --one account

10001       '2016-02-01'    1111        NULL    --Case1-one customer has multiple
10050       '2017-02-01'    1111        NULL    --accounts
400050      '2017-06-01'    1111        NULL
10089       '2017-12-08'    1111        NULL

10008       '2016-02-01'    1120        NULL    --Case2-customer has account(s) and later
10038       '2016-04-01'    1120        NULL
10058       '2017-02-03'    1120        1121    --gets account(s) with another customer

10002       '2016-02-01'    1112        NULL    --Case3-customer has account(s) and later
10052       '2017-02-02'    1113        1112    --becomes the second customer on another
10152       '2017-05-02'    1113        1112    --account(s)

10003       '2016-02-02'    1114        1115    --Case4-customer and second customer
7060        '2017-02-04'    1115        1114    --switch which is first and second

10004       '2016-02-02'    1116        1117    --Case5-second customer later gets
10067       '2017-02-05'    1117        NULL    --separate account(s)
10167       '2018-02-05'    1117        NULL

50013       '2016-01-01'    2008        NULL    --Case5b -customer has account(s) & later
50014       '2017-02-02'    2008        2009    --gets account(s) with second customer &
50015       '2017-04-04'    2008        NULL    --later still first customer gets
100015      '2018-05-05'    2008        NULL    --separate account(s)

30005       '2015-02-01'    1118        NULL    --Case6-customer has account(s) 
10005       '2016-02-01'    1118        NULL
10054       '2017-02-02'    1118        1119    --gets account(s) with another
40055       '2017-03-03'    1118        1119
10101       '2017-04-04'    1119        NULL    --who later gets separate account(s)
10201       '2017-05-05'    1119        NULL
30301       '2017-06-06'    1119        NULL
10322       '2018-01-01'    1119        NULL

10007       '2016-02-01'    1122        1123    --Case7-customers play musical chairs
10057       '2017-02-03'    1123        1124
10107       '2017-06-02'    1124        1125

50001       '2016-01-01'    2001        NULL    --Case8a-customers with account(s)
50002       '2017-02-02'    2001        2002    --together each later get separate
50003       '2017-03-03'    2001        NULL    --account(s)
50004       '2017-04-04'    2002        NULL

50005       '2016-01-01'    2003        NULL    --Case8b-customers with account(s)
50006       '2017-02-02'    2003        2004    --together each later get separate
50007       '2017-03-03'    2004        NULL    --account(s)
50008       '2017-04-04'    2003        NULL
50017       '2018-03-03'    2004        NULL
50018       '2018-04-04'    2003        NULL

50009       '2016-01-01'    2005        NULL    --Case9a-customer has account(s) & later
50010       '2017-02-02'    2005        2006    --gets account(s) with a second customer
50011       '2017-03-03'    2005        2007    --& later still gets account(s) with a
                                                --third customer

50109       '2016-01-01'    2015        NULL    --Case9b starts the same as Case9a, but
50110       '2017-02-02'    2015        2016    
50111       '2017-03-03'    2015        2017    
50112       '2017-04-04'    2015        NULL    --after all accounts with other customers
50122       '2017-05-05'    2015        NULL    --are complete, the original primary
                                                --customer begins opening individual
                                                --accounts again

Desired Results:

acctnumber  date            Cust1ID     Cust2ID 
10000       '2016-02-01'    1110        NULL    --Case0    
10089       '2017-12-08'    1111        NULL    --Case1
10058       '2017-02-03'    1120        1121    --Case2
10152       '2017-05-02'    1113        1112    --Case3
7060        '2017-02-04'    1115        1114    --Case4
10167       '2018-02-05'    1117        NULL    --Case5
100015      '2018-05-05'    2008        NULL    --Case5b
10322       '2018-01-01'    1119        NULL    --Case6
10107       '2017-06-02'    1124        1125    --Case7
50003       '2017-03-03'    2001        NULL    --Case8a result 1
50004       '2017-04-04'    2002        NULL    --Case8a result 2
50017       '2018-03-03'    2004        NULL    --Case8b result 1
50018       '2018-04-04'    2003        NULL    --Case8b result 2
50011       '2017-03-03'    2005        2007    --Case9a
50122       '2017-05-05'    2015        NULL    --Case9b

Alternatively, I would accept Case 7 outputting the two separate customer groups:

10007       '2016-02-01'    1122        1123    --Case7 result 1
10107       '2017-06-02'    1124        1125    --Case7 result 2

Because Cases 8a & 8b would represent the company acknowledging the customers are worthy of holding separate accounts, we would want to then consider their group as splitting, so it has separate sets of results.

Also, in most scenarios the customers have many accounts, and mix and matching the above cases overtime is common. For example, a single customer can have five accounts (Case 1), then later opens one or more accounts with another customer (Case 3) sometimes switching the primary account holder (Case 4) then afterwards the first customer begins opening individual accounts again (Case 5b).


I have attempted joining the table to a copy of itself whenever acctnumbers are unique and any of the Cust IDs match. However, this removes customers who have only had one account so I added a union of cust that have no matches on the custid or account number and groups by custid.

Unfortunately, the second piece does not only include custids from case 0 and there are some custids which are excluded all together that shouldn't be.

select
    max(date1) as date,
    cust1id1 as cust1id
from
(
select
    acctnumber as [acctnumber1],
    date as [date1],
    cust1id as [cust1id1],
    cust2id as [cust2id1]
from 
    acct
) t1
join
(
select
    acctnumber as [acctnumber2],
    date as [date2],
    cust1id as [cust1id2],
    cust2id as [cust2id2]
from 
    acct
) t2
on t1.date1 > t2.date2 and
(t1.cust1id1 = t2.cust1id2 or
t1.cust1id1 = t2.cust2id2 or
t1.cust2id1 = t2.cust2id2)
Group by
cust1id1
union
select
    max(date1) as date,
    cust1id1 as cust1id
from
(
select
    acctnumber as [acctnumber1],
    date as [date1],
    cust1id as [cust1id1],
    cust2id as [cust2id1]
from 
    acct
) t1
join
(
select
    acctnumber as [acctnumber2],
    date as [date2],
    cust1id as [cust1id2],
    cust2id as [cust2id2]
from 
    acct
) t2
on (t1.acctnumber1 != t2.acctnumber2 and
t1.cust1id1 != t2.cust1id2 and
t1.cust1id1 != t2.cust2id2 and
t1.cust2id1 != t2.cust2id2)
group by
cust1id1

Update

Thank you for all the great answers and comments so far. I have been trying out the queries and comparing results.

@VladimirBaranov has brought up a rare case that I had not previously considered in comments to other answers.

Similarly to case 7, it will be a bonus if Case8 is handled, but not expected.

Case 9 is important and the result for 9a and 9b should be handled.

Update 2

I noticed issues with my original set of 7 cases.

In more recent accounts, when a customer is no longer on the account, it was always the second borrower that remained. This was entirely unintentional, you can look at any of those examples and either customer can potentially be the remaining customer on the most recent account.

Also, each case had the minimum number of accounts to display exactly what the case was testing, but this is not common. Usually in each step of each case there can be 5, 10, 15 or more accounts before a customer switches to adding on a second customer, and those two can then have many accounts together.

Reviewing the answers I see many have index, create, update and other clauses specific to being able to edit the database. Unfortunately, I am on the consumer side of this database so I have read only access, and the program I can use to interact with the database automatically rejects them.

Golden Ratio
  • 349
  • 2
  • 14
  • 2
    This seems a very bad DB design. – Ankit Bajpai Dec 05 '17 at 21:15
  • 2
    Unfortunately, I am on the consumer end of this database and they have at least one idea right: it's set up as read only. – Golden Ratio Jun 07 '18 at 14:37
  • 1
    The square brackets indicate Microsoft SQL Server syntax. MySQL doesn't use square brackets around identifiers. I'll change the tags for this question to indicate that. – Bill Karwin Jun 19 '18 at 21:35
  • Why doesn't CustID 1116 have a result set? Or 1118? – Jeff Breadner Jun 19 '18 at 22:02
  • Not sure about this db design, do you not support a 3rd customer? However, SQL best practices apply here and that is you need to reformat your data into something legible, even if you are capable of writing a super query to make this work... don't. Views, CTEs, and temp tables come to mind in that order. – RandomUs1r Jun 19 '18 at 22:17
  • I understand why 1116 and 1118 don't have entries now, it's because of the association with 1117 and 1119. Does the SQL result have to be a view / CTE chain, or could it be a stored procedure or function (multiple steps / using temp tables, etc.) – Jeff Breadner Jun 19 '18 at 22:33
  • What version of SQL Server? What format of dates did you use in the question? Is it `DD/MM/YYYY` or `MM/DD/YYY`? – Vladimir Baranov Jun 20 '18 at 02:29
  • @GoldenRatio, Since you are asking for the most efficient solution, what indexes do you have right now on this table? Can you create other indexes? – Vladimir Baranov Jun 20 '18 at 02:39
  • @BillKarwin I apologize for being in the dark of which server is being used. When I first started, no one was creating queries except through a restricted 'Query Designer' and it took a lot of work for the company to admit it the language was SQL. They've had multiple versions of SQL servers through the years, so their column names are inconsistent. Thank you for your help in this matter. – Golden Ratio Jun 20 '18 at 14:01
  • @VladimirBaranov, reading up on indexes, it looks like I've used them before in a temp table (with a as -> any query with create is rejected) using row number() ordey by over(partition by order by row). Also the original table uses Style US 101 mm/dd/yyyy. – Golden Ratio Jun 20 '18 at 14:05
  • @RandomUs1r, The tables are actually split by Branches, one table per branch, this is already a temp table using a union of all the branches - I didn't want to complicate the question as some customers change branches between accounts. As an included difficulty, the account numbers of each branch start with a different identifier (the BranchID) so once joined, the temp table does not have account numbers in a reliable order. And accounts before the branches were hosted by the same server use the same numbering system, so there are duplicate accounts numbers. I will adjust theanswer accordingly – Golden Ratio Jun 20 '18 at 14:05
  • I wrote the original query a long time ago, I forgot that I solved the duplicate account numbers by concatenating the BranchID with the Account Number (although definitely not in ordered by date). also @RandomUs1r, The party I'm programming for is the executives of the company that uses the program. I am the only one who knows any programming language, and I'm self taught on SQL through W3Schools and StackExchange since starting here in Oct 2016. So to all, any chance to learn something new is greatly appreciated. I'm out today and tomorrow so I'll be reviewing answers the first time on Friday – Golden Ratio Jun 20 '18 at 14:17
  • Seems I haven't understood the problem correctly. The last account for customer 1123 is `10057|2.3.17|1123|1124`. Why isn't that record in the results when you want the last account per customer? – Thorsten Kettner Jun 26 '18 at 15:40
  • @ThorstenKettner perhaps a better descriptor for the name of the question would be 'finding each customer group's most recent account'. In the specific Case 7, there are 4 customers that could be considered one group, because each have been on an account with one of the others. So the most recent of all three accounts is what I would prefer, leaving out customer 1122 and 1123 altogether from the final results. The alternate desired result for case 7 was the first and the third account, because none of the customers from the first account were included on the third account. – Golden Ratio Jun 26 '18 at 17:27
  • but I can see now how that alternate desired result is possibly more difficult to attain. – Golden Ratio Jun 26 '18 at 17:28

9 Answers9

2

I'd like to thank Jeff Breadner for the DDL with sample data.

You'll have to run the below query step-by-step, CTE-by-CTE and examine intermediate results to understand what it does. It assumes that AcctNumber is unique in the given table.

At first I want to find the latest account for each individual customer. It is a simple top-n-per-group query and I'm using a ROW_NUMBER approach here.

CTE_Customers makes a plain list of all individual customers by putting together Cust1ID and Cust2ID. CTE_RN assigns them row numbers. CTE_LatestAccounts gives latest account for each individual customer:

+------------------+------------+--------+
| LatestAcctNumber |  LatestDT  | CustID |
+------------------+------------+--------+
|            10000 | 2016-02-01 |   1110 |
|            10050 | 2017-02-01 |   1111 |
|            10052 | 2017-02-02 |   1112 |
|            10052 | 2017-02-02 |   1113 |
|             7060 | 2017-02-04 |   1114 |
|             7060 | 2017-02-04 |   1115 |
|            10004 | 2016-02-02 |   1116 |
|            10067 | 2017-02-05 |   1117 |
|            10054 | 2017-02-03 |   1118 |
|            10101 | 2017-06-02 |   1119 |
|            10058 | 2017-02-03 |   1120 |
|            10058 | 2017-02-03 |   1121 |
|            10007 | 2016-02-01 |   1122 |
|            10057 | 2017-02-03 |   1123 |
|            10107 | 2017-06-02 |   1124 |
|            10107 | 2017-06-02 |   1125 |
+------------------+------------+--------+

The task is complicated by having customer pairs which "propagates" the latest account to another customer.

Customer pairs are defined in the original table, so CTE_MaxLatestAccounts takes each row from the original table and joins latest accounts to it twice - for Cust1D and Cust2ID. For each pair I'm picking one of the two latest accounts - the most recent one. Thus a customer that belongs to a pair may get an account from its partner.

+---------+---------+-------------+---------------------+
| Cust1ID | Cust2ID | MaxLatestDT | MaxLatestAcctNumber |
+---------+---------+-------------+---------------------+
|    1110 | NULL    | 2016-02-01  |               10000 |
|    1111 | NULL    | 2017-02-01  |               10050 |
|    1111 | NULL    | 2017-02-01  |               10050 |
|    1120 | NULL    | 2017-02-03  |               10058 |
|    1120 | 1121    | 2017-02-03  |               10058 |
|    1112 | NULL    | 2017-02-02  |               10052 |
|    1113 | 1112    | 2017-02-02  |               10052 |
|    1114 | 1115    | 2017-02-04  |                7060 |
|    1115 | 1114    | 2017-02-04  |                7060 |
|    1116 | 1117    | 2017-02-05  |               10067 |
|    1117 | NULL    | 2017-02-05  |               10067 |
|    1118 | NULL    | 2017-02-03  |               10054 |
|    1118 | 1119    | 2017-06-02  |               10101 |
|    1119 | NULL    | 2017-06-02  |               10101 |
|    1122 | 1123    | 2017-02-03  |               10057 |
|    1123 | 1124    | 2017-06-02  |               10107 |
|    1124 | 1125    | 2017-06-02  |               10107 |
+---------+---------+-------------+---------------------+

The MaxLatestAcctNumber here is applicable to both Cust1ID and Cust2ID. The same customer may be listed several times here and we need to choose an entry again with the latest account. Here this is the latest account of a pair, not for individual customer.

The approach is the same as in the beginning. Put both Cust1ID and Cust2ID customers in a list: CTE_CustomersWithLatestAccountFromPair. Assign row numbers in CTE_CustomersWithLatestAccountFromPairRN and pick final account in CTE_FinalAccounts.

+---------------------+
| MaxLatestAcctNumber |
+---------------------+
|               10000 |
|               10050 |
|               10052 |
|               10052 |
|                7060 |
|                7060 |
|               10067 |
|               10067 |
|               10101 |
|               10101 |
|               10058 |
|               10058 |
|               10057 |
|               10107 |
|               10107 |
|               10107 |
+---------------------+

Now we just need to filter the original table and leave only those rows (accounts) that appear in this list. See the final result below.

Sample data

declare @ACCT table (
    AcctNumber int,
    dt date,
    Cust1ID int,
    Cust2ID int
);

insert into @ACCT values 
(10000, '2016-02-01', 1110, null),
(10001, '2016-02-01', 1111, null),
(10050, '2017-02-01', 1111, null),
(10008, '2016-02-01', 1120, null),
(10058, '2017-02-03', 1120, 1121),
(10002, '2016-02-01', 1112, null),
(10052, '2017-02-02', 1113, 1112),
(10003, '2016-02-02', 1114, 1115),
(7060,  '2017-02-04', 1115, 1114),
(10004, '2016-02-02', 1116, 1117),
(10067, '2017-02-05', 1117, null),
(10005, '2016-02-01', 1118, null),
(10054, '2017-02-03', 1118, 1119),
(10101, '2017-06-02', 1119, null),
(10007, '2016-02-01', 1122, 1123),
(10057, '2017-02-03', 1123, 1124),
(10107, '2017-06-02', 1124, 1125);

Query

WITH
CTE_Customers
AS
(
    SELECT
        AcctNumber
        ,dt
        ,Cust1ID AS CustID
    FROM @ACCT
    WHERE Cust1ID IS NOT NULL

    UNION ALL

    SELECT
        AcctNumber
        ,dt
        ,Cust2ID AS CustID
    FROM @ACCT
    WHERE Cust2ID IS NOT NULL
)
,CTE_RN
AS
(
    SELECT
        AcctNumber
        ,dt
        ,CustID
        ,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY dt DESC) AS rn
    FROM CTE_Customers
)
,CTE_LatestAccounts
-- this gives one row per CustID
AS
(
    SELECT
        AcctNumber AS LatestAcctNumber
        ,dt AS LatestDT
        ,CustID
    FROM CTE_RN
    WHERE rn = 1
)
,CTE_MaxLatestAccounts
AS
(
    SELECT
        A.Cust1ID
        ,A.Cust2ID
        ,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
        THEN A1.LatestDT ELSE A2.LatestDT END AS MaxLatestDT
        ,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
        THEN A1.LatestAcctNumber ELSE A2.LatestAcctNumber END AS MaxLatestAcctNumber
    FROM
        @ACCT AS A
        LEFT JOIN CTE_LatestAccounts AS A1 ON A1.CustID = A.Cust1ID
        LEFT JOIN CTE_LatestAccounts AS A2 ON A2.CustID = A.Cust2ID
)
,CTE_CustomersWithLatestAccountFromPair
AS
(
    SELECT
        Cust1ID AS CustID
        ,MaxLatestDT
        ,MaxLatestAcctNumber
    FROM CTE_MaxLatestAccounts
    WHERE Cust1ID IS NOT NULL

    UNION ALL

    SELECT
        Cust2ID AS CustID
        ,MaxLatestDT
        ,MaxLatestAcctNumber
    FROM CTE_MaxLatestAccounts
    WHERE Cust2ID IS NOT NULL
)
,CTE_CustomersWithLatestAccountFromPairRN
AS
(
    SELECT
        CustID
        ,MaxLatestDT
        ,MaxLatestAcctNumber
        ,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY MaxLatestDT DESC) AS rn
    FROM CTE_CustomersWithLatestAccountFromPair
)
,CTE_FinalAccounts
AS
(
    SELECT MaxLatestAcctNumber
    FROM CTE_CustomersWithLatestAccountFromPairRN
    WHERE rn = 1
)
SELECT *
FROM @ACCT AS A
WHERE A.AcctNumber IN (SELECT MaxLatestAcctNumber FROM CTE_FinalAccounts)
;

Result

+------------+------------+---------+---------+
| AcctNumber |     dt     | Cust1ID | Cust2ID |
+------------+------------+---------+---------+
|      10000 | 2016-02-01 |    1110 | NULL    |
|      10050 | 2017-02-01 |    1111 | NULL    |
|      10058 | 2017-02-03 |    1120 | 1121    |
|      10052 | 2017-02-02 |    1113 | 1112    |
|       7060 | 2017-02-04 |    1115 | 1114    |
|      10067 | 2017-02-05 |    1117 | NULL    |
|      10101 | 2017-06-02 |    1119 | NULL    |
|      10057 | 2017-02-03 |    1123 | 1124    |
|      10107 | 2017-06-02 |    1124 | 1125    |
+------------+------------+---------+---------+

This result matches your desired result, except the last case 7.

My query doesn't attempt to follow the chain of linked customers of arbitrary length and is limited to processing one pair at a time. That's why the case 7 result is not one row. The query will always pick row/account with the very last date (10107) and it may also pick account(s) in the middle of the chain. In this case it picked a row 10057, not 10007, because this is a later account for customers 1122 and 1123.


When I looked at the execution plan I saw that the query behind CTE_LatestAccounts is run essentially four times.

It is likely that if you save result of CTE_LatestAccounts into a temp table with proper indexes the overall performance would be better.

Something like this:

DECLARE @LatestAccounts TABLE 
    (LatestAcctNumber int, LatestDT date, CustID int PRIMARY KEY);

WITH
CTE_Customers
AS
(
    SELECT
        AcctNumber
        ,dt
        ,Cust1ID AS CustID
    FROM @ACCT
    WHERE Cust1ID IS NOT NULL

    UNION ALL

    SELECT
        AcctNumber
        ,dt
        ,Cust2ID AS CustID
    FROM @ACCT
    WHERE Cust2ID IS NOT NULL
)
,CTE_RN
AS
(
    SELECT
        AcctNumber
        ,dt
        ,CustID
        ,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY dt DESC) AS rn
    FROM CTE_Customers
)
,CTE_LatestAccounts
-- this gives one row per CustID
AS
(
    SELECT
        AcctNumber AS LatestAcctNumber
        ,dt AS LatestDT
        ,CustID
    FROM CTE_RN
    WHERE rn = 1
)
INSERT INTO @LatestAccounts (LatestAcctNumber, LatestDT, CustID)
SELECT LatestAcctNumber, LatestDT, CustID
FROM CTE_LatestAccounts;


WITH
CTE_MaxLatestAccounts
AS
(
    SELECT
        A.Cust1ID
        ,A.Cust2ID
        ,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
        THEN A1.LatestDT ELSE A2.LatestDT END AS MaxLatestDT
        ,CASE WHEN ISNULL(A1.LatestDT, '2000-01-01') > ISNULL(A2.LatestDT, '2000-01-01')
        THEN A1.LatestAcctNumber ELSE A2.LatestAcctNumber END AS MaxLatestAcctNumber
    FROM
        @ACCT AS A
        LEFT JOIN @LatestAccounts AS A1 ON A1.CustID = A.Cust1ID
        LEFT JOIN @LatestAccounts AS A2 ON A2.CustID = A.Cust2ID
)
,CTE_CustomersWithLatestAccountFromPair
AS
(
    SELECT
        Cust1ID AS CustID
        ,MaxLatestDT
        ,MaxLatestAcctNumber
    FROM CTE_MaxLatestAccounts
    WHERE Cust1ID IS NOT NULL

    UNION ALL

    SELECT
        Cust2ID AS CustID
        ,MaxLatestDT
        ,MaxLatestAcctNumber
    FROM CTE_MaxLatestAccounts
    WHERE Cust2ID IS NOT NULL
)
,CTE_CustomersWithLatestAccountFromPairRN
AS
(
    SELECT
        CustID
        ,MaxLatestDT
        ,MaxLatestAcctNumber
        ,ROW_NUMBER() OVER (PARTITION BY CustID ORDER BY MaxLatestDT DESC) AS rn
    FROM CTE_CustomersWithLatestAccountFromPair
)
,CTE_FinalAccounts
AS
(
    SELECT MaxLatestAcctNumber
    FROM CTE_CustomersWithLatestAccountFromPairRN
    WHERE rn = 1
)
SELECT *
FROM @ACCT AS A
WHERE A.AcctNumber IN (SELECT MaxLatestAcctNumber FROM CTE_FinalAccounts)
;

If you really need to merge/group all linked customers into one row when the length of the chain is arbitrary, you can do it with recursive query like shown, for example, here: How to find all connected subgraphs of an undirected graph

Once you have tagged each customer with some GroupID, find the latest account for each individual customer as in the beginning of this query. Then find the latest account among the group (rather than for the simple pair as in this query).

The query that finds all subgraphs of an undirected graph in the linked question may be quite slow for a large dataset and there are efficient non-set based algorithms to do it.

If you know that the maximum length of the chain can't exceed some number, it is possible to make this recursive query more efficient.

Vladimir Baranov
  • 31,799
  • 5
  • 53
  • 90
  • this works to get the desired result +1, thank you. It is seven times slower than iSR5's. – Golden Ratio Jun 26 '18 at 19:44
  • @GoldenRatio, it is understandable. iSR5's query and my query may produce different results, it depends on the actual data in your table. If your data is such that both of our queries produce identical results, then iSR5's simpler version is the way to go. – Vladimir Baranov Jun 27 '18 at 00:00
  • if you can think of an example that your query works where iSR5's would not, I will definitely want to use yours over his. Accuracy is the more important factor for me in this scenario. – Golden Ratio Jun 28 '18 at 14:02
  • @GoldenRatio, my query and iSR5's query produce different results in Case 7, 8a, 8b. The queries use different logic. Most likely there are other cases, not listed in the question, where results will be different. It is totally up to you to decide which logic suits you. It doesn't make much sense to compare performance of these queries, because they produce different results. – Vladimir Baranov Jun 29 '18 at 00:30
  • I have found examples where yours is also not accurate for at least some Case 1 customers. The customer may have 10 or 15 accounts, and the query seems to be random in which it chooses for the final result. I'll try to look into it further. – Golden Ratio Jun 29 '18 at 14:49
2

To apply logic to each subset a good operator to use is the CROSS APPLY operator. This allows us to find the most recent account for each Customer Id.

Setup

DECLARE @Stage TABLE
(
    AcctNumber INT
    ,[Date] DATETIME
    ,Cust1Id INT
   ,Cust2Id INT
)

INSERT INTO @Stage (AcctNumber, [Date] ,Cust1Id ,Cust2Id)
VALUES
(10000,'2.1.16',1110,NULL)
,(10001,'2.1.16',1111,NULL)
,(10050,'2.1.17',1111,NULL)
,(10008,'2.1.16',1120,NULL)
,(10058,'2.3.17',1120,1121)
,(10002,'2.1.16',1112,NULL)
,(10052,'2.2.17',1113,1112)
,(10003,'2.2.16',1114,1115)
,(7060,'2.4.17',1115,1114)
,(10004,'2.2.16',1116,1117)
,(10067,'2.5.17',1117,NULL)
,(10005,'2.1.16',1118,NULL)
,(10054,'2.3.17',1118,1119)
,(10101,'6.2.17',1119,NULL)
,(10007,'2.1.16',1122,1123)
,(10057,'2.3.17',1123,1124)
,(10107,'6.2.17',1124,1125)

--Additional Cases to cover
,(50001, '2016-01-01', 2001, NULL)
,(50002, '2017-02-02', 2001, 2002)
,(50003, '2017-03-03', 2001, NULL)
,(50004, '2017-04-04', 2002, NULL)

,(50005, '2016-01-01', 2003, NULL)
,(50006, '2017-02-02', 2003, 2004)
,(50007, '2017-03-03', 2004, NULL)
,(50008, '2017-04-04', 2003, NULL)

Execution

Cross Apply

;WITH Results AS(
    SELECT DISTINCT S2.*
    FROM @Stage S1
    CROSS APPLY (
        SELECT TOP 1 S2.*
        FROM @Stage S2
        WHERE 
            (S1.Cust1Id = S2.Cust1Id
            OR S1.Cust1Id = S2.Cust2Id
            OR S1.Cust2Id = S2.Cust1Id
            OR S1.Cust2Id = S2.Cust2Id)
        ORDER BY S2.[Date] DESC
            ) S2
)
SELECT R1.*
FROM Results R1
    LEFT JOIN Results R2
        ON R1.Cust2Id = R2.Cust1Id
WHERE R1.[Date] > R2.[Date]
    OR R2.AcctNumber IS NULL

The CROSS APPLY operators walk the cases backward to apply the logic to each joint account case while ensuring the most recent account is carried over. This alone covers most of the cases. The only lingering cases are the ones with 3 accounts being shifted between 3 customers. The self join and WHERE clause in the final select cover these.

Results

+------------+------------+---------+---------+
| AcctNumber | Date       | Cust1Id | Cust2Id |
| 7060       | 2017-02-04 | 1115    | 1114    |
| 10000      | 2016-02-01 | 1110    | NULL    |
| 10050      | 2017-02-01 | 1111    | NULL    |
| 10052      | 2017-02-02 | 1113    | 1112    |
| 10058      | 2017-02-03 | 1120    | 1121    |
| 10067      | 2017-02-05 | 1117    | NULL    |
| 10101      | 2017-06-02 | 1119    | NULL    |
| 10107      | 2017-06-02 | 1124    | 1125    |
| 50003      | 2017-03-03 | 2001    | NULL    |
| 50004      | 2017-04-04 | 2002    | NULL    |
| 50007      | 2017-03-03 | 2004    | NULL    |
| 50008      | 2017-04-04 | 2003    | NULL    |
+------------+------------+---------+---------+
Matt Rowland
  • 4,575
  • 4
  • 25
  • 34
  • Should you add one more `OR S1.Cust2Id = S2.Cust2Id` to the filter? I'm not sure. – Vladimir Baranov Jun 22 '18 at 12:47
  • @VladimirBaranov Good catch! I have added that. It doesn't change anything in the cases from OP but it would catch other intended cases. Thanks! – Matt Rowland Jun 22 '18 at 13:05
  • Still, at least one case (that OP didn't list) doesn't work out as I'd expect it to. Try to add these rows to the sample data: `(50001, '2016-01-01', 2001, NULL), (50002, '2017-02-02', 2001, 2002), (50003, '2017-03-03', 2001, NULL), (50004, '2017-04-04', 2002, NULL), ` I'd expect that final result should have only one extra row `(50004, '2017-04-04', 2002, NULL)`, but your query returns two: `(50003, '2017-03-03', 2001, NULL), (50004, '2017-04-04', 2002, NULL),`. – Vladimir Baranov Jun 22 '18 at 13:10
  • @VladimirBaranov I added in another nested `CROSS APPLY` this walks the multiple account backward. This covers the new cases that you stated. I also added another case that had a different relation between `Cust1Id`, `Cust2Id`, and `Date`. – Matt Rowland Jun 22 '18 at 14:30
  • I like how your original idea and query looked really elegant. Unfortunately, it becomes not so pretty when all cases are taken care of. – Vladimir Baranov Jun 23 '18 at 01:25
  • @VladimirBaranov I have similar thoughts. However, complex issues require complex solutions. I appreciate the contributions as well. – Matt Rowland Jun 23 '18 at 01:41
  • @VladimirBaranov I was able to simplify the solution again. This was because of the OP's desired result being different than our assumptions. – Matt Rowland Jun 26 '18 at 14:13
  • Is there a way to do this without the temp table? I was attempting to test and queries with delete, drop & insert are automatically rejected. – Golden Ratio Jun 26 '18 at 14:33
  • @GoldenRatio Why are they rejected? Is it a permissions thing? – Matt Rowland Jun 26 '18 at 14:34
  • @GoldenRatio I updated my answer to remove the temp table. the `@Stage` is still present as it represents your actual database table. – Matt Rowland Jun 26 '18 at 15:20
  • @MattRowland with the updated version it attempts to run, but the query takes too long to run and times out at 10 minutes. – Golden Ratio Jun 26 '18 at 19:41
1

I'm sure there is a much easier approach, but this is what I've had in mind :

SELECT 
    a.acctnumber, 
    a.date, 
    a.Cust1ID, 
    a.Cust2ID 
FROM acct a
OUTER APPLY (
SELECT acctnumber
FROM (
SELECT *, 
    ROW_NUMBER() OVER(PARTITION BY acctnumber ORDER BY [date] DESC) AS ACC_RN,
    ROW_NUMBER() OVER(PARTITION BY CustomerID ORDER BY [date] DESC) AS RN
FROM (
SELECT 
     a1.acctnumber,
     a1.[date],
     a1.Cust1ID AS CustomerID
FROM acct a1
UNION 
SELECT 
     a2.acctnumber,
     a2.[date],
     a2.Cust2ID
FROM acct a2
) D
) C
WHERE 
    RN = 1
AND CustomerID IS NOT NULL
AND ACC_RN = 2
) acc
WHERE a.acctnumber IN(acc.acctnumber)
iSR5
  • 3,274
  • 2
  • 14
  • 13
  • Same note as on the Matt Rowland's answer. If you add few more rows to the sample data as Matt did `,(50001, '2016-01-01', 2001, NULL) ,(50002, '2017-02-02', 2001, 2002) ,(50003, '2017-03-03', 2001, NULL) ,(50004, '2017-04-04', 2002, NULL) ,(50005, '2016-01-01', 2003, NULL) ,(50006, '2017-02-02', 2003, 2004) ,(50007, '2017-03-03', 2004, NULL) ,(50008, '2017-04-04', 2003, NULL)` you'll see that your query returns more rows than one would reasonably expect. – Vladimir Baranov Jun 23 '18 at 12:08
  • @VladimirBaranov could you explain the logic behind the additional cases ? the query will show (50003, 50004, 50007, 50008) based on the OP logic, and based on the additional cases, I saw that Cust1ID #2001 has opened an account, then an year later, gets a second account with another customer, then one month later, both of them have new separate accounts. Same thing happened with 2003 and 2004. is that the logic you're trying to cover ? – iSR5 Jun 23 '18 at 13:32
  • It is similar to case 6 in the question. Only OP can tell us what result he wants to see in this specific example, but based on my understanding I think that query should return only `50004` and `50008`. – Vladimir Baranov Jun 24 '18 at 07:38
  • @VladimirBaranov if I understood the cases correctly, the expected results would be (50003, 50004, 50007, 50008) because of the same logic that I've mentioned before. if you want to replicate case 6, then we'll need to delete either 50003 or 50004. same thing on 50007, and 50008. As case 6 in the OP example shows that 118 was having a separate account, then shared it with 119, then 118 closed account, and 119 got a separate account, and that's why 118 was dropped off the results. this is how I understood the OP logic, and compared with your cases. – iSR5 Jun 24 '18 at 08:37
  • It is better to ask OP. Hopefully he is reading the comments and can add clarifications to his question. My reasoning is this: Customers `2001` and `2002` have shared an account at some point in time. That's why the result should have the last account, which belongs to either `2001` or `2002` or both, which is `50004`. Same with the second example. Customers `2003` and `2004` have shared an account at some point in time. The latest account that is associated with either `2003` or `2004` is `50008`. – Vladimir Baranov Jun 24 '18 at 11:28
  • @VladimirBaranov I agree, as we might be overdoing it as well ;) .. I believe a bank database should have more detailed structure than this one. (Imagine accounts table without the status of its accounts (closed, active ..etc). it'll be a horrible experience haha ;) – iSR5 Jun 24 '18 at 12:38
  • @iSR5 the table has quite a bit more info, but all of that is irrelevant to the question so I left it out. I was actually trying to find the status of each customer group's most recent account, and that is why I started to work on this query to begin with. Your query works great +1, and I've still a few more to test, but it's the fastest so far. – Golden Ratio Jun 26 '18 at 19:38
  • @GoldenRatio glad that I've could help. – iSR5 Jun 26 '18 at 20:29
  • @GoldenRatio just for my curiosity, is there any issue with my answer ? as it covers all cases (including the additional cases). if there is something that not covered, please let me know. (my answer was accepted, then unaccepted suddenly?) – iSR5 Jun 29 '18 at 19:37
  • @iSR5 When I attempted to implement the solution after accepting the answer, the managers of the branches who are much more familiar with their own data bombarded me with examples of the query not working. Looking at these specific examples, I believe I found why it was not working for what I thought was case 5, and have edited the Update accordingly, with a case 5b. It is not always the second borrower that ends up with the most recent account. – Golden Ratio Jul 03 '18 at 20:27
  • @GoldenRatio I can understand your situation, your updated cases where clear except case 5b, i see the customer has two accounts, and sharing one of them, but you dropped it from your desired results, and I don't know why ? for the rest you can check this http://sqlfiddle.com/#!18/98e4e/1 – iSR5 Jul 04 '18 at 06:49
  • @iSR5 I must have missed adding that to the desired results upon the edit, I edited again to correct. My goal of Case 5b is to show if a customer group splits up, but the second customer never opens another account, I want the most recent account of only the first customer. – Golden Ratio Jul 05 '18 at 15:37
  • @GoldenRatio The query that I gave you in sqlfiddle, it'll cover all cases except case 5b, the issue here is that if you'll only depend on the provided columns, then you won't go further than that, so you need to review your schema to find a proper column that would cover case 5b and any future cases. I'm telling you this, because all cases were simple except case 5b, without knowing the actual status of the account, you'll always get incorrect results. – iSR5 Jul 05 '18 at 21:27
  • @GoldenRatio, this is one thought, another thought that what if a customer has more than 3 accounts ? how you can actually know if these are active or not ? so depending on date only won't give you much results. that's why I asked to review your database schema to find a proper way to filter records. – iSR5 Jul 05 '18 at 21:30
  • @iSR5, I do not have access to the schema, I've found the tables with a query that searches the database, the columns with select * and I have to guess at the data types. However, I do know and use the 'Account Status' column that describes the state of each account, but am unable to use it. – Golden Ratio Aug 07 '18 at 14:27
  • Unfortunately, this query is an important piece to a larger query and the purpose of that query needs this one to ignore account status. The larger query is looking for the most recent Closed account of only the customers who have not since opened another account. If this query makes any decisions based on status, I either get false positives or lose results. – Golden Ratio Aug 07 '18 at 14:33
1

Could you just use a left-join to join accounts with other "linked" accounts with potentially later dates, and then just filter out records where the "Later Account" table is not null? Something like this:

select ThisAccount.* 
from Accounts ThisAccount
left join Accounts LaterAccount on
    LaterAccount.AcctNumber <> ThisAccount.AcctNumber
    and LaterAccount.dt > ThisAccount.dt
    and
    (   LaterAccount.Cust1ID = ThisAccount.Cust1ID
        or LaterAccount.Cust2ID = ThisAccount.Cust1ID
        or LaterAccount.Cust1ID = ThisAccount.Cust2ID
        or LaterAccount.Cust2ID = ThisAccount.Cust2ID
    )
where LaterAccount.AcctNumber is null
order by ThisAccount.AcctNumber

This should return the results as expected:

AcctNo  Dt          Cust1   Cust2
7060    2017-02-04  1115    1114
10000   2016-02-01  1110    NULL
10050   2017-02-01  1111    NULL
10052   2017-02-02  1113    1112
10058   2017-02-03  1120    1121
10067   2017-02-05  1117    NULL
10101   2017-06-02  1119    NULL
10107   2017-06-02  1124    1125
50003   2017-03-03  2001    NULL
50004   2017-04-04  2002    NULL
50007   2017-03-03  2004    NULL
50008   2017-04-04  2003    NULL
Zack
  • 2,220
  • 1
  • 8
  • 12
0

My answer is wrong, sorry for posting prematurely. I'm working on a different idea, I'll be back shortly.


Original response:

Assuming your date format is MM.DD.YY, I've got the code as shown below. I don't understand why your desired result set doesn't include rows for CustID 1116 or 1118, but I do see how including them will duplicate 1117 and 1119 respectively, unless the source data are modified to remove these duplicate 1117 and 1119 values from the results. For now, I have this interim solution, pending your response.

declare @ACCT table (
  acctnumber int,
  date date,
  Cust1ID int,
  Cust2ID int
);

insert into @ACCT values (10000, '2016-02-01', 1110, null);
insert into @ACCT values (10001, '2016-02-01', 1111, null);
insert into @ACCT values (10050, '2017-02-01', 1111, null);
insert into @ACCT values (10008, '2016-02-01', 1120, null);
insert into @ACCT values (10058, '2017-02-03', 1120, 1121);
insert into @ACCT values (10002, '2016-02-01', 1112, null);
insert into @ACCT values (10052, '2017-02-02', 1113, 1112);
insert into @ACCT values (10003, '2016-02-02', 1114, 1115);
insert into @ACCT values (7060,  '2017-02-04', 1115, 1114);
insert into @ACCT values (10004, '2016-02-02', 1116, 1117);
insert into @ACCT values (10067, '2017-02-05', 1117, null);
insert into @ACCT values (10005, '2016-02-01', 1118, null);
insert into @ACCT values (10054, '2017-02-03', 1118, 1119);
insert into @ACCT values (10101, '2017-06-02', 1119, null);
insert into @ACCT values (10007, '2016-02-01', 1122, 1123);
insert into @ACCT values (10057, '2017-02-03', 1123, 1124);
insert into @ACCT values (10107, '2017-06-02', 1124, 1125);

with

OneCustId as (
select
  acctnumber,[date], Cust1ID as CustID
from
  @ACCT

union

select
  acctnumber, [date], Cust2ID
from
  @ACCT
),

SortedByLastUsage as (
select
  acctnumber, [date], CustID, row_number() over (partition by CustID order by [date] desc) as RowID
from
  OneCustId
),

LastUsage as (
select
  acctnumber, [date], CustID
from
  SortedByLastUsage
where
  RowID = 1
)

select distinct
  ACCT.acctnumber, ACCT.[date], ACCT.Cust1ID, ACCT.Cust2ID
from
  @ACCT ACCT
  inner join LastUsage on
    ACCT.acctnumber = LastUsage.acctnumber and
    ACCT.[date] = LastUsage.[date] and
    LastUsage.CustID in (ACCT.Cust1ID, ACCT.Cust2ID)
order by
  Cust1ID, Cust2ID

The result set:

acctnumber  date    Cust1ID Cust2ID
10000   2016-02-01  1110    NULL
10050   2017-02-01  1111    NULL
10052   2017-02-02  1113    1112
7060    2017-02-04  1115    1114
10004   2016-02-02  1116    1117
10067   2017-02-05  1117    NULL
10054   2017-02-03  1118    1119
10101   2017-06-02  1119    NULL
10058   2017-02-03  1120    1121
10007   2016-02-01  1122    1123
10057   2017-02-03  1123    1124
10107   2017-06-02  1124    1125
Jeff Breadner
  • 1,366
  • 9
  • 19
0

I'm leaving my original answer in place, because the approach might work for someone else searching for this down the line.

I can't figure out how to do this without a cursor. As such, any other answer that provides the right answer (that doesn't use a cursor) is going to outperform this one. I'm not smart enough to figure out what that looks like, but it would have to include a nasty recursive CTE.

The real trick is getting all accounts that were ever related to each other grouped together. That is done in the big cursored if/then/else chain at the top, which could be cleaned up a bit. I've left my debug print statements in place, they can obviously be removed.

You could also make the Associations table permanent, instead of using a table variable.

Again, performance-wise, this is going to be really, really bad, but it does work. I'm looking forward to seeing what others come up with. Thanks for the high-quality question, too, that made life a lot easier.

The code:

declare @Associations table (
  GroupID int,
  CustID int
);

declare @NextGroupID int = 0;
declare @FoundGroup1ID int;
declare @FoundGroup2ID int;
declare @Cust1 int;
declare @Cust2 int;

declare db_cursor cursor for
select Cust1ID, Cust2ID from @ACCT;

open db_cursor;
fetch next from db_cursor into @Cust1, @Cust2;

while @@fetch_status = 0
begin
  set @FoundGroup1ID = null;
  set @FoundGroup2ID = null;
  print '----------------------------'
  print 'Cust1 = ' + isnull(cast(@Cust1 as varchar(max)), 'NULL')
  print 'Cust2 = ' + isnull(cast(@Cust2 as varchar(max)), 'NULL')

  select @FoundGroup1ID = GroupID from @Associations where CustID = @Cust1
  print 'FoundGroup1ID = ' + isnull(cast(@FoundGroup1ID as varchar(max)), 'NULL')

  if @Cust2 is null
  begin
    if @FoundGroup1ID is null 
    begin
      set @NextGroupID = @NextGroupID +1
      print 'Adding Cust1 to new group ' + cast(@NextGroupID as varchar(max))
      insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust1)
    end
  end 
  else -- @Cust2 is not null
  begin

    print 'FoundGroup2ID = ' + isnull(cast(@FoundGroup2ID as varchar(max)), 'NULL')
    select @FoundGroup2ID = GroupID from @Associations where CustID = @Cust2

    if @FoundGroup1ID is null and @FoundGroup2ID is null
    begin
      set @NextGroupID = @NextGroupID +1
      print 'Adding both to new group ' + cast(@NextGroupID as varchar(max))
      insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust1)
      insert into @Associations (GroupID, CustID) values (@NextGroupID, @Cust2)
    end 
    else if @FoundGroup1ID is not null and @FoundGroup2ID is null
    begin
      print 'Adding Cust2 to existing group ' + cast(@FoundGroup1ID as varchar(max))
      insert into @Associations (GroupID, CustID) values (@FoundGroup1ID, @Cust2)
    end
    else if @FoundGroup1ID is null and @FoundGroup2ID is not null
    begin
      print 'Adding Cust1 to existing group ' + cast(@FoundGroup2ID as varchar(max))
      insert into @Associations (GroupID, CustID) values (@FoundGroup2ID, @Cust1)
    end
    else -- Neither is null
    begin
      print 'Switching all of GroupID ' + cast(@FoundGroup2ID as varchar(max)) + ' to GroupID ' + cast(@FoundGroup1ID as varchar(max))
      update @Associations set GroupID = @FoundGroup1ID where GroupID = @FoundGroup2ID
    end
  end
  fetch next from db_cursor into @Cust1, @Cust2;
end
close db_cursor;
deallocate db_cursor;

;with

AddedGroupID as (
select
  ACCT.acctnumber,
  ACCT.[date],
  ACCT.Cust1ID,
  ACCT.Cust2ID,
  Associations.GroupID,
  row_number() over (partition by Associations.GroupID order by ACCT.[date] desc) as RowID
from
  @ACCT ACCT
  inner join @Associations Associations on
    Associations.CustID in (ACCT.Cust1ID, ACCT.Cust2ID)
)

select 
  acctnumber, [date], Cust1ID, Cust2ID
from 
  AddedGroupID
where
  RowID = 1

The results:

acctnumber  date    Cust1ID Cust2ID
10000   2016-02-01  1110    NULL
10050   2017-02-01  1111    NULL
10058   2017-02-03  1120    1121
10052   2017-02-02  1113    1112
7060    2017-02-04  1115    1114
10067   2017-02-05  1117    NULL
10101   2017-06-02  1119    NULL
10107   2017-06-02  1124    1125
Jeff Breadner
  • 1,366
  • 9
  • 19
  • I am on the consumer side of this database, so it is read only and any query that uses the keywords insert, create, or update are automatically rejected. – Golden Ratio Jun 26 '18 at 19:59
  • Even with a table variable? What if you were to use a different SQL machine and access your main data via a linked server? There are ways around it. But, I can understand wanting to prioritize a more simplistic solution. Unfortunately, the solution I present here does require the ability to insert in some form; if this isn't allowed, then you'll need to pick a different solution. – Jeff Breadner Jun 27 '18 at 00:33
  • Our only access to the database is through a Citrix Portal, then within their software, a text box interface that queries have to be pasted into – Golden Ratio Jun 27 '18 at 13:55
0

we should not worry about using EXISTS as it operate fast in such case and i suppose is simplest possible solution:

SELECT
    A.ACCTNUMBER, A.DT as "date", A.CUST1ID, A.CUST2ID
FROM
    ACCT A
WHERE
        NOT EXISTS
        (SELECT
            *
        FROM
            ACCT A2
        WHERE
            (A2.CUST1ID = A.CUST1ID
            OR A2.CUST2ID = A.CUST1ID
            OR (A.CUST2ID IS NOT NULL AND A2.CUST1ID = A.CUST2ID)
            OR (A.CUST2ID IS NOT NULL AND A2.CUST2ID = A.CUST2ID)
            )
            AND A2.DT>A.DT
        )

i have assumed that you have separate indexes on CUST1ID and another on CUST2ID. You can compare result without ascending index on DT ("date") field and with it. It can speed up your query or slow down - i do not know how your real data looks like

Livius
  • 958
  • 1
  • 6
  • 19
0

Try below query. It lengthy, because there's need to apply repeatedly windowed functions (you can't nest them in a single query), but the query itself is pretty simple. The core idea is to split customers that never share account from customers that do. After that, for the single-account customer, the grouping column is easy, it's Cust1ID, but for other, you have to do some operations described below, to get grouping column:

To obtain grouping column (for multiple-account cutomers), you have to apply following logic:

Put all 1st customers together with 2nd customers in same column using UNION ALL (CTE called cte in query). Then, when you sort by that column, and checking both IDs with IDs of the following row, you can check if they are "connected", i.e. they have at least one ID the same:

case when Cust1ID in (cust1idLead, cust2idLead) or Cust2ID in (cust1idLead, cust2idLead) then 1 else 0 end SameGroup

This way you can distinct groups and within that groups take maximum respectively to the date (dt column).

Sample data:

declare @tbl table (acctnumber int, dt date ,   Cust1ID int,    Cust2ID  int);
insert into @tbl values
(10000, '2.1.16', 1110, null),
(10001, '2.1.16', 1111, null),
(10050, '2.1.17', 1111, null),
(10008, '2.1.16', 1120, null),
(10058, '2.3.17', 1120, 1121),
(10002, '2.1.16', 1112, null),
(10052, '2.2.17', 1113, 1112),
(10003, '2.2.16', 1114, 1115),
(7060, '2.4.17', 1115, 1114),
(10004, '2.2.16', 1116, 1117),
(10067, '2.5.17', 1117, null),
(10005, '2.1.16', 1118, null),
(10054, '2.3.17', 1118, 1119),
(10101, '6.2.17', 1119, null),
(10007, '2.1.16', 1122, 1123),
(10057, '2.3.17', 1123, 1124),
(10107, '6.2.17', 1124, 1125)

T-SQL:

;with SingleAccounts as (
    select cust1id from @tbl
    where Cust2ID is null
    except
    select cust1id from @tbl
    where Cust2ID is not null
    except
    select cust2id from @tbl
), cte as (
    select  acctnumber, dt, Cust1ID, Cust2ID from @tbl
    where Cust1ID not in (select Cust1ID from SingleAccounts)
    union all
    select  acctnumber, dt, Cust2ID, Cust1ID from @tbl
    where Cust1ID not in (select Cust1ID from SingleAccounts) and Cust2ID is not null
), SingleAmountsResult as (
    select acctnumber, dt, cust1id, cust2id,
           ROW_NUMBER() over (partition by cust1id order by dt desc) rn 
    from @tbl 
    where cust1id in (select Cust1ID from SingleAccounts)
), FinalResult as (
    select acctnumber, dt, cust1id, cust2id from SingleAmountsResult
    where rn = 1
    union all
    select acctnumber, dt, cust1id, cust2id
    from (
        select acctnumber, dt, cust1id, cust2id,
               ROW_NUMBER() over (partition by GroupingColumn order by dt desc) rn
        from (
            select acctnumber, dt, cust1id, cust2id,
                   SUM(NewGroup) over (order by cust1id, cust2id) GroupingColumn
            from (
                select acctnumber, dt, cust1id, cust2id,
                       case when LAG(SameGroup) over (order by cust1id, cust2id) = 0 then 1 else 0 end NewGroup
                from (
                    select acctnumber, dt, cust1id, cust2id,
                           case when Cust1ID in (cust1idLead, cust2idLead) or Cust2ID in (cust1idLead, cust2idLead) then 1 else 0 end SameGroup
                    from (
                        select acctnumber, dt, cust1id, cust2id,
                               LEAD(cust1id) over (order by cust1id, cust2id) cust1idLead,
                               LEAD(cust2id) over (order by cust1id, cust2id) cust2idLead
                        from cte
                    ) a 
                ) a
            ) a 
        ) a 
    ) a where rn = 1
)

--this final query gets you correct Cust1ID and Cust2ID, as FinalResult might have them switched
select * from @tbl
intersect
select * from (
    select acctnumber, dt, cust1id, cust2id from FinalResult
    union all
    select acctnumber, dt, cust2id, cust1id from FinalResult
) fr

UPDATE

This code, accordingly to OP explanation, treat all customers ID that have ever been on one account together as same group (and this is transitive1)), thus, for additional cases, 8a and 8b the results are:

acctnumber | dt         | Cust1ID | Cust2ID
50004      | 2017-04-04 | 2002    | NULL
50008      | 2017-04-04 | 2003    | NULL 

as there will be only 2 groups!

1) it means, if element a is in group with element b and b is in the same group as element c then it implies that a and c are also in the same group.

Michał Turczyn
  • 32,028
  • 14
  • 47
  • 69
  • the query processor ran out of internal resources and could not produce a query plan, so I was unable to appropriately test this query, but as it, it takes too long to run. – Golden Ratio Jun 26 '18 at 19:57
0

This is quite complex...

First you want to identify groups of customers. That is all customers who were directly or indirectly related. With customer pairs A/B, B/C, D/E, D/F, G/A, H/A, H/F you'd have just one single group for instance. In SQL this requires a recursive query.

SQL Server lacks a cycle detection in recursive queries. So from customers A/B you'd get to all pairs containing A or B, which is B/C, A/B G/A, H/A, and A/B itself for that matter. Even, if we detect this direct circle (same pair), we'd go on with B/C looking for all records that contain B or C. And one of these is A/B again and once more we are in a cycle. One way to deal with this is to build a string of yet visited customers and not visit them again.

Our result is all cutomers with all directly or indirectly connected other customers. Using aggregation, we can take the minimum partner per customer and use this as a group key. In above example all customers are related to A, so A is all their minimum partner, showing that all belong to the same group. If we add two records X/Y and Z/-, then we have two more groups: X and Y belonging to the X group, and Z belonging to the Z group.

These groups we use to look up our original records again. With ROW_NUMBER we number each group's last record with #1. Then we keep only those and we are done.

with all_cust(custid) as
(
  select cust1id from mytable
  union
  select cust2id from mytable where cust2id is not null
)
, cte(c1, c2, sofar) as
(
  select custid, custid, '<' + cast(custid as varchar(max)) + '>' from all_cust
  union all
  select cte.c1, case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end,
     cte.sofar + '<' + cast(case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end as varchar(max)) + '>'
  from mytable m
  join cte on cte.c2 in (m.cust1id, m.cust2id)
  and cte.sofar not like '%' + cast(case when cte.c2 = m.cust1id then m.cust2id else m.cust1id end as varchar(max)) + '%'
)
, groups(custid, grp) as
(
  select c1, min(c2) from cte group by c1
)
, ranked as
(
  select *, row_number() over (partition by g.grp order by date desc) as rn 
  from groups g
  join mytable m on g.custid in (m.cust1id, m.cust2id)
)
select acctnumber, date, cust1id, cust2id
from ranked
where rn = 1
order by cust1id;

Rextester demo: http://rextester.com/RWCQ83881

Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73