I have a large dataset (3667856 x 20), which gives me a warning message below:
library(data.table)
library(zoo)
data[, new_quant_PD := na.locf(QUANT_PD,na.rm=FALSE), by=c('OBLIGOR_ID','PORTFOLIO','OBLIGATION_NUMBER')]
Warning messages:
1: In `[.data.table`(data, , `:=`(new_quant_PD, na.locf(QUANT_PD, ... :
Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or been created manually using structure() or similar). Avoid key<-, names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. Also, in R<=v3.0.2, list(DT1,DT2) copied the entire DT1 and DT2 (R's list() used to copy named objects); please upgrade to R>v3.0.2 if that is biting. If this message doesn't help, please report to datatable-help so the root cause can be fixed.
In order to understand the situation better, I created the following simpler (yet similar) example:
tmp = data.table(name=c('Zhao','Zhao','Zhao','Qian','Qian','Sun','Sun','Li','Li','Li'),score=c('B+',NA,'B',NA,NA,NA,'A',NA,'A-',NA))
tmp
name score
1: Zhao B+
2: Zhao NA
3: Zhao B
4: Qian NA
5: Qian NA
6: Sun NA
7: Sun A
8: Li NA
9: Li A-
10: Li NA
tmp[,new_score:=na.locf(score,na.rm=FALSE),by='name']
tmp
name score new_score
1: Zhao B+ B+
2: Zhao NA B+
3: Zhao B B
4: Qian NA NA
5: Qian NA NA
6: Sun NA NA
7: Sun A A
8: Li NA NA
9: Li A- A-
10: Li NA A-
This smaller example does not generate a warning message at all.
In theory I can loop over all combinations of OBLIGOR_ID
, PORTFOLIO
, and OBLIGATION_NUMBER
, and find out which one(s) is (are) causing the trouble, but data
is only part of a 81293658 row dataset that I have. I don't think I can afford so much loop time in R.
Any suggestion is greatly appreciated!