2

I've customers transaction data of around 20 million records for a month. For a campaign I need to get customers' qualification for reward on following patterns:

  1. Customer will qualify for each transaction he would perform after 7 days.
  2. Any transaction performed within 7 days window will be ignored

For Example: 1) If User-A has done transaction on 3rd, 4th, 6th, 7th, 9th, 11th, 28th - he will be rewarded for 3rd, 9th & 28th dated transactions and all transactions in between will be ignored. 2) If User-B has done transaction on 1st, 4th, 11th, 17th, 21st, 30th - he will be rewarded for 1st, 11th, 17th & 30th dated transactions and all transactions in between will be ignored. 3) if User-C has done transaction on 1st and 30th - he will be rewarded for both transactions.

I've spent 3 days and tried to do it many ways, but due to my limited knowledge I could not succeeded.

I've tried to do it through loop query which was giving required results but processing 20 million records through loop is taking hell of time.

Please help me with any efficient solution to get this done. I'll be really really grateful for this.

This picture shows the sample transaction data of customers

This image is to help understand the problem statement

Following is simple query statement which is just good for nothing but I've tried:

SELECT  t1.[FINANCIAL ID], 
        t1.MSISDN, 
        t1.[DATE],
        MIN(t2.[DATE]) AS [NEXT DATE],
        ISNULL(DATEDIFF(DAY, t1.[DATE], MIN(t2.[DATE])), 0) AS DAYSDIFF1
FROM    mydb.dbo.RequiredTrxnForCampaign t1
        LEFT JOIN mydb.dbo.RequiredTrxnForCampaign t2
            ON t1.MSISDN = t2.MSISDN
            AND t2.[DATE] > t1.[DATE]
GROUP BY t1.[FINANCIAL ID], t1.MSISDN, t1.[DATE]

Following is Loop query I've tried but it is taking 40 minutes for 100K records with all possible optimizations I could have done.

DECLARE @minid int = (SELECT MIN(rownumber) FROM mydb.dbo.Test_5k t)
DECLARE @maxid int = (SELECT MAX(rownumber) FROM mydb.dbo.Test_5k t)

DECLARE @fid varchar(11) = NULL
DECLARE @msisdn varchar(20) = NULL
DECLARE @date datetime = NULL
DECLARE @product varchar(50) = NULL
DECLARE @checkmsisdn smallint = NULL
DECLARE @checkdate datetime = NULL
DECLARE @datediff int = NULL

TRUNCATE TABLE mydb.dbo.MinDateTable
TRUNCATE TABLE mydb.dbo.Test_5k_Result


WHILE (@minid <= @maxid)
BEGIN

SET @fid =          (SELECT tk.[FINANCIAL ID] FROM dbo.Test_5k tk WHERE tk.rownumber = @minid)
SET @msisdn =       (SELECT tk.MSISDN FROM dbo.Test_5k tk WHERE tk.rownumber = @minid)
SET @date =         (SELECT tk.[DATE] FROM dbo.Test_5k tk WHERE tk.rownumber = @minid)
SET @product =      (SELECT tk.[PRODUCT NAME] FROM dbo.Test_5k tk WHERE tk.rownumber = @minid)
SET @checkmsisdn =  (SELECT count(*) FROM dbo.MinDateTable mdt WHERE mdt.MSISDN=@msisdn)
SET @checkdate =    (SELECT mdt.[MIN DATE] FROM dbo.MinDateTable mdt WHERE mdt.MSISDN=@msisdn)
SET @datediff =     (ISNULL(DATEDIFF(DAY, @checkdate, @date), 0))


IF (@checkmsisdn = 0)
BEGIN
    INSERT INTO dbo.MinDateTable (MSISDN, [MIN DATE])
    VALUES (@msisdn, @date);

    INSERT INTO dbo.Test_5k_Result (MSISDN, [DATE], [PRODUCT NAME], [FINANCIAL ID], DAYSDIFF)
    VALUES (@msisdn, @date, @product, @fid, @datediff);
END
ELSE
BEGIN
    IF (@checkmsisdn > 0 AND @datediff >= 6)
    BEGIN
        UPDATE dbo.MinDateTable
        SET [MIN DATE] = @date
        WHERE MSISDN=@msisdn

        INSERT INTO dbo.Test_5k_Result (MSISDN, [DATE], [PRODUCT NAME], [FINANCIAL ID], DAYSDIFF)
        VALUES (@msisdn, @date, @product, @fid, @datediff);
    END
END

SET @minid = @minid + 1
END;

Required result are to get all those transactions out of 20 million where customer will be rewarded as per above details.

GMB
  • 216,147
  • 25
  • 84
  • 135
  • Q1. Is the 7 day window including the endpoint? Q2. Is the 7-day window fixed or rolling? If it is fixed (day 1-7, 8-14, 15-21,...) then I don't understand case XXXXXX807 in your attached picture (day10 should be rewarded); If it is rolling from the first unrewarded transaction performed, then I don't understand why the 9th transaction get rewarded (should be 11th) in your 1st example. PLEASE CLARIFY. – Bill Huang Oct 05 '19 at 18:46
  • A1. It is 7 day windows excluding endpoint. So, if customer is rewarded on 1st his next reward day will be 7th or any day beyond 7th (if he do not perform transaction on 7th) A2. It is on rolling window basis. If customer get rewarded on 1st on performing transaction, he will not get any reward for next days before 7th even if he performs transaction and than get rewarded for any transaction done on 7th or later and from next transaction date his next 7 days cycle will be start and so on. – Anwaar E Mustafa Oct 06 '19 at 00:40
  • Are you sure "excluding the endpoint" is correct in your business logic? In this way a customer doing transaction every day will be awarded every 6 days: 123456|123456|123456|.... . This contradicts real life experience where there are 7 days per week. If what you stated is indeed true, then your time window is actually not 7 days but 6 days instead because that is the real resetting period. In this case I will re-run my code using this new parameter. – Bill Huang Oct 06 '19 at 01:24

3 Answers3

2

You can easily implement arbitrary aggregation logic using an updatable cursor. When I fail to find appropriate advanced SQL functions to tackle my problem, this is usually the ultimate killer I fall back to.

A potential advantage of using a cursor to process large datasets is that it can avoid costly join operations and hence minimizing data I/O.

The solution is done with only 2 data passes. The first pass is creating a separate answer dataset, which is usually necessary in a real business use to protect the original dataset. The second pass is to compute reward or not row-by-row. Hence, for a dataset of 20M records, it should be more efficient than any solutions involving joins.

You may also see my answer on another question, which is basically a simplified variant of your question.

Tested on sql server 2017 latest (linux docker image)

Test Dataset

use [testdb];
if OBJECT_ID('testdb..test') is not null
    drop table testdb..test;

create table test (
    MSISDN varchar(50),
    [date] datetime
);
GO

-- name list, need not be sorted
insert into test(MSISDN, [date]) 
values ('1', '2019-01-01'),
       ('1', '2019-01-06'),
       ('1', '2019-01-07'),
       ('1', '2019-01-08'),
       ('1', '2019-01-12'),
       ('1', '2019-01-17'),
       ('1', '2019-01-19'),
       ('1', '2019-01-22'),
       ('2', '2019-01-05'),
       ('2', '2019-01-09'),
       ('2', '2019-01-11'),
       ('2', '2019-01-12'),
       ('2', '2019-01-20'),
       ('2', '2019-01-31');

declare @reward_window int = 7;  -- D = last reward day
                                 -- Transaction on D, D+1, ... D+6 -> no reward
                                 -- First transaction on and after D+7 -> rewarded

Solution

/* Setup */

-- Create answer dataset
if OBJECT_ID('tempdb..#ans') is not NULL
    drop table #ans;

select 
    -- Create a unique key to enable cursor update
    -- A pre-existing unique index can also be used
    row_number() over(order by MSISDN, [date]) as rn,

    MSISDN,
    -- Date part only. Or just [date] to include the time part
    CONVERT(date, [date]) as [date],  
    -- differnce between this and previous transactions from the same customer
    datediff(day, 
             LAG([date], 1, '1970-01-01') over(partition by [MSISDN] 
                                               order by [date]),
             [date]
            ) as diff_days,
    -- no reward by default
    0 as reward 
into #ans
from test
order by MSISDN, [date];

create unique index idx_rn on #ans(rn);

-- check
-- select * from #ans;

-- cursor for iteration
declare cur cursor local
for select rn, MSISDN, [date], diff_days, reward 
    from #ans 
    order by MSISDN, [date]
for update of [reward];
open cur;

-- fetched variables
declare @rn int,
        @MSISDN varchar(50), 
        @DT datetime, 
        @diff_days int,
        @reward int;

-- State from previous row
declare @MSISDN_prev varchar(50) = '', 
        @DT_prev datetime = '1970-01-01', 
        @days_to_last_reward int = 0;

/* Main loop */
while 1=1 begin

    -- read next line and check termination condition
    fetch next from cur
        into @rn, @MSISDN, @DT, @diff_days, @reward;

    if @@FETCH_STATUS <> 0
        break;

    /* Main logic here **/
    -- accumulate days_to_last_reward
    set @days_to_last_reward += @diff_days;

    -- Reward for new customer or days_to_last_reward >= @reward_window)
    if @MSISDN <> @MSISDN_prev or @days_to_last_reward >= @reward_window begin
        update #ans
            set reward = 1
            where current of cur;
        -- reset days_to_last_reward
        set @days_to_last_reward = 0;
    end

    -- setup next round
    set @MSISDN_prev = @MSISDN;
    set @DT_prev = @DT;
end

-- cleanup
close cur;
deallocate cur;

-- show
select * -- MSISDN, [date], reward 
from #ans 
order by MSISDN, [date];

Output

This should make sense provided that a customer rewarded on Jan. 1 can be rewarded again on Jan. 8.

| rn | MSISDN | date       | diff_days | reward |
|----|--------|------------|-----------|--------|
| 1  | 1      | 2019-01-01 | 17897     | 1      |
| 2  | 1      | 2019-01-06 | 5         | 0      |
| 3  | 1      | 2019-01-07 | 1         | 0      |
| 4  | 1      | 2019-01-08 | 1         | 1      |
| 5  | 1      | 2019-01-12 | 4         | 0      |
| 6  | 1      | 2019-01-17 | 5         | 1      |
| 7  | 1      | 2019-01-19 | 2         | 0      |
| 8  | 1      | 2019-01-22 | 3         | 0      |
| 9  | 2      | 2019-01-05 | 17901     | 1      |
| 10 | 2      | 2019-01-09 | 4         | 0      |
| 11 | 2      | 2019-01-11 | 2         | 0      |
| 12 | 2      | 2019-01-12 | 1         | 1      |
| 13 | 2      | 2019-01-20 | 8         | 1      |
| 14 | 2      | 2019-01-31 | 11        | 1      |
Bill Huang
  • 4,491
  • 2
  • 13
  • 31
1

You can do this using recursive CTEs . . . which may not be so bad for only a small amount of data for each customer:

with cte as (
      select msisdn, date
      from (select t.*,
                   row_number() over (partition by msisdn order by date) as seqnum
            from RequiredTrxnForCampaign t
           ) t
      where seqnum = 1
      union all
      select t.msisdn, t.date
      from cte cross apply
           (select top (1) t.*
            from RequiredTrxnForCampaign t
            where t.msisdn = cte.msisdn and
                  t.date >= dateadd(day, 7, cte.date)
            order by t.date asc
           ) t
    )
select msisdn, date
from cte
order by msisdn, date;

Do not attempt this without an index on (msisdn, date).

You can then apply the filtering logic for a specific period of time. I would recommend filtering in the first part of the CTE.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
0

Use integer arithmetic to group rows and find min day number whithin a group. Demo

create table foo (
    id int,
    customer varchar(10),
    dayn int
);

insert foo values
     ( 1,'A', 3) 
    ,( 2,'A', 4)
    ,( 3,'A', 6)
    ,( 4,'A', 7)
    ,( 5,'A', 9)
    ,( 6,'A',11)
    ,( 7,'A',28) 
    ,( 8,'B', 1)
    ,( 9,'B', 4)
    ,(10,'B',11)
    ,(11,'B',17)
    ,(12,'B',21)
    ,(13,'B',30);

select top(1) with ties id, customer, dayn
from foo 
order by row_number() over(partition by customer, (dayn - 1) / 7 order by dayn);
Serg
  • 22,285
  • 5
  • 21
  • 48