0

I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows:

  • 5 million observations
  • 7 variables of which three of interest in this case:
  • Code ID Buy
  • Code ID Sell
  • Date

I would like another variable called new, which takes the value 0 in line i if:

  • there exists an observation k, with Code_IB_Buy_[i]=Code_IB_Buy_[k] and Code_IB_Sell_[i]=Code_IB_Sell_[k] and Date[i] is after Date[k] if not, I would like new[i]=1.

Basically if there was no transaction before between the buyer and the seller, this is the first occurrence in line i so the variable new takes the value 1.

What I have tried until now, is to sort the database (as a data.table) by date and by buyer. Then I use a For Loop, which would work in other languages but here it does not for some reason.

EDIT: A reduced sample would look like this:

    library(data.table)
set.seed(1)
Data <- data.frame(
  Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
  Amount <- rnorm(21,mean=100,sd=20),
  Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
  Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A"),
  Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,0,1,2),
LastRelationshipseller = c("98C","98C","98C","98C","98C","98C","98C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
    )

Data$new<-0

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
  r <- rowid(Code_ID_Buy, Code_ID_Sell)
  +(r==1L)
}]

Data[Month==1L, new:=0L]

    View(Data)

The data is already ordered since I used (and it works fine) : Data<-Data[order(month),] Data<-Data[order(Code_ID_Buy),]

However, what I would like to do now is to introduce the new column "new" that takes value 1 if and if only R does not find any observation where ID_Buy and ID_Sell are the same before the given date (new customer relationship)

In the example above, line 7,14 and 19 should be marked with 1 but none of the others.

In a second step, I would like to calculate the number of months between the first and second relationship (a column "distance", which would take the value "month of new relationship"-"month of last new relationship" else, that is in line 14, it would take the value "2",5-3, but this is getting ahead of myself).

Is this sufficient information ? Thank you a lot already for your time.

EDIT : @chinsoon12 thank you a lot, you are really good ! this seem to work like that(I actually didn't want NA when new = 0, so I changed your formula by removing new==1L and it does not work because the distance for all lines with new=0 is 1 because it calculates the time not with the last new == 1L but the distance with the last new==0L)

I have also however a small question : I used

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
    by=.EACHI, i.Month - x.Month]$V1], but is it possible to know which was the Code_ID_Sell for the line which serves as basis (in line "x" from what I understand)  for each i?
Nicolas
  • 43
  • 5
  • 1
    Even if you can't share your data, you can create a small randomly generated example of your data set that provides an idea of what needs to be done. – Phil Feb 28 '20 at 06:09
  • 1
    If you create a reproducible example, people can help you more easily. Check here to understand how to create an R reproducible example: https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Farzad Minooei Feb 28 '20 at 06:29
  • why is it the first few rows are not 1 since there are no identical code_buy and code sell before that – chinsoon12 Feb 28 '20 at 10:50
  • @Nicolas, fyi, pinging someone in the body of a post does not work, it needs to be in a comment. – chinsoon12 Feb 29 '20 at 00:10
  • @chinsoon12 thank you a lot for your solutions. I will be able to test the last one as soon as Thursay on my database but since it works on a small sample I would assume it works on the full database ! – Nicolas Mar 01 '20 at 19:38
  • @chinsoon12 I do not understand why but in my full database when using your second function Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month – Nicolas Mar 04 '20 at 13:29
  • Can you update your version of data.table? I am using 1.12.8 – chinsoon12 Mar 04 '20 at 23:00
  • @chinsoon12 I cannot update it after version 1.10.4-3 because it is on a computer with no internet connection and I need special rights to update the package :( – Nicolas Mar 05 '20 at 10:25
  • Then you need to replace .SD with copy(.SD) as suggested by the error message. Performance might degrade though but no choice – chinsoon12 Mar 05 '20 at 11:32
  • @chinsoon12 thank you a lot, you are really good ! this seem to work like that. I have however a small question : I used Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month – Nicolas Mar 06 '20 at 10:06
  • @chinsoon12 for sure, I just changed the code above including the variable distancewithlastr1 – Nicolas Mar 06 '20 at 12:30
  • @chinsoon12 the distance with the last time there was a new relationship. if you mean regarding the additional question, I updated the code with a new column ! Thank you a lot ! – Nicolas Mar 06 '20 at 12:50
  • @chinsoon12 yes Distancewithlastr1 is the new desired column ! – Nicolas Mar 06 '20 at 12:52
  • i dont understand the logic... *is it possible to know which was the Code_ID_Sell for the line which serves as basis (in line "x" from what I understand) for each i where distance is not na ?* – chinsoon12 Mar 06 '20 at 12:55
  • @chinsoon The logic of Distancewithlastr1 is to calculate the distance bewteen month i in line i and the last month where new=1, so if no new=1 is recorded, Distancewithlastr1=NA. If new=1 in month 3 and no new is recorded (for a given Buyer), then in month 4, Distancewithlastr1 is equal to 1, and Distancewithlastr1=2 in month. SInce there is a new=1 in month 5, in month 6, we start again and Distancewithlastr1=1... is it clearer ? – Nicolas Mar 06 '20 at 13:00
  • i dont understand the logic... is it possible to know which was the Code_ID_Sell for the line which serves as basis (in line "x" from what I understand) for each i where distance is not na ? – I added the column Lastrelationshipseller to illustrate – Nicolas Mar 06 '20 at 13:02
  • why are u not starting from 0 in month 6? – chinsoon12 Mar 06 '20 at 13:09
  • @chinsoon12 because even though there is a new relationship in month 6, the distance bewteen this new relationship and the last new relationship (in month 5) is =1, right ? – Nicolas Mar 06 '20 at 13:17

1 Answers1

0

An option is to order by Month, Code_ID_Buy and Code_ID_Sell and then set the first row in each group of Code_ID_Buy and Code_ID_Sell to 1:

library(data.table)
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), 
    new := +(rowid(Code_ID_Buy, Code_ID_Sell) == 1L)]

Your requirement basically translates into finding the first combi of Code_ID_Buy and Code_ID_Sell, here is another option using duplicated:

setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell),
    new := !duplicated(.SD), .SDcols=c("Code_ID_Buy", "Code_ID_Sell")]

output:

    Month    Amount Code_ID_Buy Code_ID_Sell new
 1:     1  87.47092        100D          98C   1
 2:     1 103.67287        100D          99C   1
 3:     2  83.28743        100D          98C   0
 4:     2 131.90562        100D          99C   0
 5:     3 106.59016        100D          98C   0
 6:     3  83.59063        100D          99C   0
 7:     3 109.74858        100D          96V   1
 8:     4 114.76649        100D          98C   0
 9:     4 111.51563        100D          99C   0
10:     4  93.89223        100D          96V   0
11:     5 130.23562        100D          98C   0
12:     5 107.79686        100D          99C   0
13:     5  87.57519        100D          96V   0
14:     5  55.70600        100D          94D   1
15:     6 122.49862        100D          98C   0
16:     6  99.10133        100D          99C   0
17:     6  99.67619        100D          96V   0
18:     6 118.87672        100D          94D   0
19:     3 116.42442        102D          25A   1
20:     4 111.87803        102D          25A   0
21:     5 118.37955        102D          25A   0
    Month    Amount Code_ID_Buy Code_ID_Sell new

This solution has 1 for the first few rows as well since there are no rows prior to them with the same Code_ID_Buy or Code_ID_Sell. If it is necessary to remove them, you can use Data[Month==1L, new := 0L]


For the 2nd question, you can use a non-equi join to find rows where Month is before current one and new=1L for the same Code_ID_Buy:

Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
    by=.EACHI, i.Month - x.Month]$V1]

output:

    Month    Amount Code_ID_Buy Code_ID_Sell new distance
 1:     1  87.47092        100D          98C   1       NA
 2:     1 103.67287        100D          99C   1       NA
 3:     2  83.28743        100D          98C   0       NA
 4:     2 131.90562        100D          99C   0       NA
 5:     3 106.59016        100D          98C   0       NA
 6:     3  83.59063        100D          99C   0       NA
 7:     3 109.74858        100D          96V   1        2
 8:     4 114.76649        100D          98C   0       NA
 9:     4 111.51563        100D          99C   0       NA
10:     4  93.89223        100D          96V   0       NA
11:     5 130.23562        100D          98C   0       NA
12:     5 107.79686        100D          99C   0       NA
13:     5  87.57519        100D          96V   0       NA
14:     5  55.70600        100D          94D   1        2
15:     6 122.49862        100D          98C   0       NA
16:     6  99.10133        100D          99C   0       NA
17:     6  99.67619        100D          96V   0       NA
18:     6 118.87672        100D          94D   0       NA
19:     3 116.42442        102D          25A   1       NA
20:     4 111.87803        102D          25A   0       NA
21:     5 118.37955        102D          25A   0       NA
    Month    Amount Code_ID_Buy Code_ID_Sell new distance

for your 3rd qn, you can try:

Data[, dlr := if (k>0L) rleid(Month) - 1L, .(Code_ID_Buy, k=cumsum(new))] 

output:

    Month    Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr
 1:     1  87.47092        100D          98C                 NA                    98C   0       NA  NA
 2:     1 103.67287        100D          99C                 NA                    98C   0       NA  NA
 3:     2  83.28743        100D          98C                 NA                    98C   0       NA  NA
 4:     2 131.90562        100D          99C                 NA                    98C   0       NA  NA
 5:     3 106.59016        100D          98C                 NA                    98C   0       NA  NA
 6:     3  83.59063        100D          99C                 NA                    98C   0       NA  NA
 7:     3 109.74858        100D          96V                  0                    98C   1       NA   0
 8:     4 114.76649        100D          98C                  1                    96V   0       NA   1
 9:     4 111.51563        100D          99C                  1                    96V   0       NA   1
10:     4  93.89223        100D          96V                  1                    96V   0       NA   1
11:     5 130.23562        100D          98C                  2                    96V   0       NA   2
12:     5 107.79686        100D          99C                  2                    96V   0       NA   2
13:     5  87.57519        100D          96V                  2                    96V   0       NA   2
14:     5  55.70600        100D          94D                  2                    96V   1        2   0
15:     6 122.49862        100D          98C                  1                    94D   0       NA   1
16:     6  99.10133        100D          99C                  1                    94D   0       NA   1
17:     6  99.67619        100D          96V                  1                    94D   0       NA   1
18:     6 118.87672        100D          94D                  1                    94D   0       NA   1
19:     3 116.42442        102D          25A                  0                   <NA>   1       NA   0
20:     4 111.87803        102D          25A                  1                    25A   0       NA   1
21:     5 118.37955        102D          25A                  2                    25A   0       NA   2
    Month    Amount Code_ID_Buy Code_ID_Sell Distancewithlastr1 LastRelationshipseller new distance dlr

p.s.: Welcome to SO. Just to let you know that it might be good ettiquete to post a new question rather than add a new question to a previous question and in addition, you might get others to respond as well.

chinsoon12
  • 25,005
  • 4
  • 25
  • 35