I have the following issue and I hope you may help me: I have a huge database (which I cannot disclose) but is it is structured as follows:
- 5 million observations
- 7 variables of which three of interest in this case:
- Code ID Buy
- Code ID Sell
- Date
I would like another variable called new
, which takes the value 0
in line i
if:
- there exists an observation
k
, withCode_IB_Buy_[i]=Code_IB_Buy_[k]
andCode_IB_Sell_[i]=Code_IB_Sell_[k]
andDate[i]
is afterDate[k]
if not, I would likenew[i]=1
.
Basically if there was no transaction before between the buyer and the seller, this is the first occurrence in line i
so the variable new
takes the value 1
.
What I have tried until now, is to sort the database (as a data.table) by date and by buyer. Then I use a For Loop, which would work in other languages but here it does not for some reason.
EDIT: A reduced sample would look like this:
library(data.table)
set.seed(1)
Data <- data.frame(
Month = c(1,1,2,2,3,3,3,4,4,4,5,5,5,5,6,6,6,6,3,4,5),
Amount <- rnorm(21,mean=100,sd=20),
Code_ID_Buy = c("100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","100D","102D","102D","102D"),
Code_ID_Sell = c("98C","99C","98C","99C","98C","99C","96V","98C","99C","96V","98C","99C","96V","94D","98C","99C","96V","94D","25A","25A","25A"),
Distancewithlastr1 = c(NA,NA,NA,NA,NA,NA,NA,1,1,1,2,2,2,2,1,1,1,1,0,1,2),
LastRelationshipseller = c("98C","98C","98C","98C","98C","98C","98C","96V","96V","96V","96V","96V","96V","96V","94D","94D","94D","94D",NA,"25A","25A")
)
Data$new<-0
setDT(Data)[order(Month, Code_ID_Buy, Code_ID_Sell), new := {
r <- rowid(Code_ID_Buy, Code_ID_Sell)
+(r==1L)
}]
Data[Month==1L, new:=0L]
View(Data)
The data is already ordered since I used (and it works fine) : Data<-Data[order(month),] Data<-Data[order(Code_ID_Buy),]
However, what I would like to do now is to introduce the new column "new" that takes value 1 if and if only R does not find any observation where ID_Buy and ID_Sell are the same before the given date (new customer relationship)
In the example above, line 7,14 and 19 should be marked with 1 but none of the others.
In a second step, I would like to calculate the number of months between the first and second relationship (a column "distance", which would take the value "month of new relationship"-"month of last new relationship" else, that is in line 14, it would take the value "2",5-3, but this is getting ahead of myself).
Is this sufficient information ? Thank you a lot already for your time.
EDIT : @chinsoon12 thank you a lot, you are really good ! this seem to work like that(I actually didn't want NA when new = 0, so I changed your formula by removing new==1L and it does not work because the distance for all lines with new=0 is 1 because it calculates the time not with the last new == 1L but the distance with the last new==0L)
I have also however a small question : I used
Data[new==1L, distance := .SD[.SD, on=.(Code_ID_Buy, Month<Month), mult="last",
by=.EACHI, i.Month - x.Month]$V1], but is it possible to know which was the Code_ID_Sell for the line which serves as basis (in line "x" from what I understand) for each i?