Using R for extracting rows from a large dataset which meet multiple conditions for each vendor_id

Question

Using R Programming. I have dataset with Vendor_id, Bank_account_no and Date with over 3 million. I want to get the rows for each vendor_id where the Bank_account_no changes, for example from X to X to X (at least three time, could be more than three) to Y (just once) to X again within a three months period. The dataset is such that the changes are all random so the window is not fixed with number of rows for each vendor_id. I using the rle function to get length for different Bank_account_no. Not sure how can I create a logic in R for these many rows considering I want to run this logic for each vendor_id. May be data.table can help in this. The input is as follows:

Vendor_ID   Bank_account_no   Date    

<!-- -->
dddd       X              24-12-2018
dddd       X              24-12-2018
dddd       X              26-12-2018
dddd       Y              27-12-2018
dddd       X              28-12-2018
dddd       X              29-12-2018
dddd       X              29-12-2018
dddd       X              31-12-2018
dddd       X              24-01-2019
dddd       Z              25-01-2019
dddd       X              28-01-2019
dddd       G              28-01-2019
dddd       G              28-01-2019
eeee       A              30-01-2019
eeee       A              31-01-2019
eeee       A              31-01-2019    
eeee       B              31-01-2019
eeee       A              31-01-2019

The output should be:

Vendor_ID   Bank_account_no   Date    Case

<!-- -->
dddd       X              24-12-2018  Case1
dddd       X              24-12-2018  Case1
dddd       X              26-12-2018  Case1
dddd       Y              27-12-2018  Case1
dddd       X              28-12-2018  Case1
dddd       X              29-12-2018  Case2
dddd       X              29-12-2018  Case2
dddd       X              31-12-2018  Case2
dddd       X              24-01-2019  Case2
dddd       Z              25-01-2019  Case2
dddd       X              28-01-2019  Case2
eeee       A              30-01-2019  Case3
eeee       A              31-01-2019  Case3
eeee       A              31-01-2019  Case3    
eeee       B              31-01-2019  Case3
eeee       A              31-01-2019  Case3

It looks like you will need [streak_run()](https://gogonzo.github.io/runner/reference/streak_run.html) function from [runner](https://gogonzo.github.io/runner/index.html) package. However, I don't understand exectly what you need to achieve, so I didn't prepared ready solution. — GoGonzo, Nov 12 '19 at 07:55

chinsoon12 · Answer 1 · 2019-11-15T12:46:08.967

0

There might be edge cases not covered but this is an option to start you off:

#get run length encoding for ID
DT[, r := rleid(ID)]

#identify locations of desired changes in ID
DT[, w := ID!=shift(ID, -1L) & ID==shift(ID, -2L)]

#set to TRUE at the start of a sequence
DT[, sw := replace(shift(w, 3L, fill=FALSE), 1L, TRUE), Vendor_ID]

#filter out those not part of any sequence
ans <- DT[-DT[, {
        ix <- which(w)
        if (length(ix) > 0L) .I[!r %in% r[sapply(ix, `+`, 0L:2L)]]
    }, Vendor_ID]$V1]

#create the desired Case column
ans[, Case := paste0("Case", cumsum(sw))]

output:

    Vendor_ID ID       Date r     w    sw  Case
 1:      dddd  X 24-12-2018 1 FALSE  TRUE Case1
 2:      dddd  X 24-12-2018 1 FALSE FALSE Case1
 3:      dddd  X 26-12-2018 1  TRUE FALSE Case1
 4:      dddd  Y 27-12-2018 2 FALSE FALSE Case1
 5:      dddd  X 28-12-2018 3 FALSE FALSE Case1
 6:      dddd  X 29-12-2018 3 FALSE  TRUE Case2
 7:      dddd  X 29-12-2018 3 FALSE FALSE Case2
 8:      dddd  X 31-12-2018 3 FALSE FALSE Case2
 9:      dddd  X 24-01-2019 3  TRUE FALSE Case2
10:      dddd  Z 25-01-2019 4 FALSE FALSE Case2
11:      dddd  X 28-01-2019 5 FALSE FALSE Case2
12:      eeee  A 30-01-2019 7 FALSE  TRUE Case3
13:      eeee  A 31-01-2019 7 FALSE FALSE Case3
14:      eeee  A 31-01-2019 7  TRUE FALSE Case3
15:      eeee  B 31-01-2019 8    NA FALSE Case3
16:      eeee  A 31-01-2019 9    NA FALSE Case3

data:

library(data.table)
DT <- fread("Vendor_ID   ID   Date    
dddd       X              24-12-2018
dddd       X              24-12-2018
dddd       X              26-12-2018
dddd       Y              27-12-2018
dddd       X              28-12-2018
dddd       X              29-12-2018
dddd       X              29-12-2018
dddd       X              31-12-2018
dddd       X              24-01-2019
dddd       Z              25-01-2019
dddd       X              28-01-2019
dddd       G              28-01-2019
dddd       G              28-01-2019
eeee       A              30-01-2019
eeee       A              31-01-2019
eeee       A              31-01-2019    
eeee       B              31-01-2019
eeee       A              31-01-2019")

edited Nov 15 '19 at 12:46

answered Nov 12 '19 at 00:25

chinsoon12

25,005
4
25
35

Thanks for your revert. I tried using this code but run into an error: Error in shift(ID, -1L) : n must be non-negative integer values (>= 0) – Sara Nov 13 '19 at 11:26
you need data.table version >= 1.12.0 – chinsoon12 Nov 13 '19 at 11:50
Is there anything equivalent for shift() in data.table 1.10.4 version? – Sara Nov 13 '19 at 15:55
I replaced that line of code with this: DT[, w := ID!=shift(ID, 1L,type="lag") & ID==shift(ID, 2L,type="lag")] Is this equivalent to what you wrote in your code? – Sara Nov 13 '19 at 16:54
When I run this code with my dataset, it throws error : Error in r[sapply(ix, `+`, 0L:2L)] : invalid subscript type 'list' . Can you help with this? – Sara Nov 15 '19 at 11:43
Can you add a print(ix) just before this line? – chinsoon12 Nov 15 '19 at 11:53
It gives integer(0) as the result – Sara Nov 15 '19 at 14:15
Ok I added something yesterday to handle cases like this. Can you try again? – chinsoon12 Nov 15 '19 at 22:51
The code runs but is not showing the correct result: shows "ans" with 0 rows because the length(ix)=0 in the logic. – Sara Nov 18 '19 at 09:47
can you share some data to replicate that error? hard to debug with any data – chinsoon12 Nov 18 '19 at 09:48

Using R for extracting rows from a large dataset which meet multiple conditions for each vendor_id

1 Answers1