I have data as follows:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
10 1986 NA
10 1986 NA
10 1987 240
10 1987 NA
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
12 1984 240
I need to filter rows based on following criteria:
- For each
PERMNO
, sort data bydate
- Parse through the sorted data and delete rows after a company gets delisted (ie. DLSTCD != NA)
- If the first row corresponds to company getting delisted, do not include any rows for that company
Based on these criteria, following is my expected output:
PERMNO date DLSTCD
10 1983 NA
10 1985 250
11 1984 NA
11 1984 NA
11 1985 NA
11 1987 NA
I am using data.table
in R to work with this data. The example above is an oversimplified version of my actual data, which contains around 3M rows corresponding to 30k PERMNOs.
I implemented three different methods for doing this, as can be seen here:
r-fiddle: http://www.r-fiddle.org/#/fiddle?id=4GapqSbX&version=3
Below I compare my implementations using a small dataset of 50k rows. Here are my results:
Time Comparison
system.time(dt <- filterbydelistingcode(dt)) # 39.962 seconds
system.time(dt <- filterbydelistcoderowindices(dt)) # 39.014 seconds
system.time(dt <- filterbydelistcodeinline(dt)) # 114.3 seconds
As you can see all my implementations are extremely inefficient. Can someone please help me implement a much faster version for this? Thank you.
Edit: Here is a link to a sample dataset of 50k rows that I used for time comparison: https://ufile.io/q9d8u
Also, here is a customized read function for this data:
readdata = function(filename){
data = read.csv(filename,header=TRUE, colClasses = c(date = "Date"))
PRCABS = abs(data$PRC)
mcap = PRCABS * data$SHROUT
hpr = data$RET
HPR = as.numeric(levels(hpr))[hpr]
HPR[HPR==""] = NA
data = cbind(data,PRCABS,mcap, HPR)
return(data)
}
data <- readdata('fewdata.csv')
dt <- as.data.table(data)