I have just started using R and wrote the following code but this is taking about 40 mins to process, so I am sure this can be coded in a way that it runs faster.
Basically, I have one large data set (crsp_td_net
) of about 7GB n size and a second smaller data set (ff_35f
). Both contain trading dates. What I am trying to do is to fill in trading dates for each company in the first data set.
From my first data set, I am creating subsets of data based on a company index, and then merging each subset with the second data set based on trading dates. This merged data set is appended to the other companies data set and so on until at the end, I am left with a large data set with all the initial companies but with the missing trading days incorporated.
I am not sure at this point whether the fact that the data frame final
keeps expanding is causing my loop to run slowly or whether the loop is itself coded inefficiently. I understand that vectorization of the data can help speed this up, but I am not sure how to do this here (the matrix size of the subsets of data keeps changing for each company). I am also not sure of the best way to use apply
, sapply
or lapply
(if any of these can be used here) for this. I have browsed a few queries on R but was I have not found a way to go about this. I would very much appreciate an alternative snippet of code that can make the below run faster.
todo<-matrix(numeric(0), 0,4)
for (i in 1:7396) {
final<- crsp_td_net %>%
filter(compid==i) %>%
merge(ff_35f,by="date_crsp",all=TRUE)
final<-final%>% filter(between(date_crsp,
as.Date(min(date_crsp_orig,na.rm="TRUE")),
as.Date(max(date_crsp_orig, na.rm="TRUE")))) %>%
arrange(date_crsp) %>%
mutate(cusip8dg_compustat =
ifelse(is.na(cusip8dg_compustat),
max(cusip8dg_compustat, na.rm="TRUE"),
cusip8dg_compustat)) %>%
mutate(compid = ifelse(is.na(compid), i, compid))%>%
select(compid, cusip8dg_compustat, date_crsp,
date_crsp_orig)%>%
distinct()
todo<-bind_rows(todo,final)
}
Thanks in advance,
Dev
Thank you all for your response. I was unable to reply in the comment box due to limit on response, so I am adding to my original post. @P Lapointe, please find a reproducible data set (I have used integer values instead of actual dates) @eipi10 - I think you have understood what I am after and thanks for the code but I am not sure if it is missing something as it is prompting for an input (I have all relevant libraries). @Alistaire - I will indeed be facing memory problems as I perform more calculations to add to the original data set. Grateful for your suggestions on how to make the loop faster/an alternative to it, which would be very helpful to understand how they would be implemented in the example below.
many thanks
zz <- "compid date_crsp
1 1 2
2 1 3
3 1 5
4 2 3
5 2 7
6 2 9
7 3 3
8 3 5
9 3 7
10 3 8"
crsp_td_net <- read.table(text=zz, header = TRUE)
xx <- "date_crsp
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
11 11"
ff_35f <- read.table(text=xx, header = TRUE)
# I expect my final output to look like this:
yy<-"compid date_crsp
1 1 2
2 1 3
3 1 4
4 1 5
5 2 3
6 2 4
7 2 5
8 2 6
9 2 7
10 2 8
11 2 9
12 3 3
13 3 4
14 3 5
15 3 6
16 3 7
17 3 8"
output_wanted<-read.table(text=yy, header = TRUE)
df <- full_join(crsp_td_net, expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp)))
todo<-array(numeric(),c(1,4,0))
todo<-matrix(numeric(0), 0,0)
for (i in 1:3) {
final<- filter(crsp_td_net,compid==i)
final<- mutate(final,date_crsp_orig=date_crsp)
final<- merge(final,ff_35f, by="date_crsp",all=TRUE)
final<- filter(final,between(date_crsp, min(date_crsp_orig, na.rm=TRUE), max(date_crsp_orig, na.rm=TRUE)))
final<- arrange(final,date_crsp)
final<- mutate(final,compid = ifelse(is.na(compid), i, compid))
final<- select(final,compid, date_crsp)
final<- distinct(final)
todo<-bind_rows(todo,final)
}
I have modified the full_join example and it now runs but is not doing what I want it to do re merging each compid with unique trading days to fill in missing trading days in the first data set. I would very much appreciate any suggestion on this please.
The loop I wrote above works to give me exactly what I want, but I was wondering if there is a faster way to do this as I will have to loop over 7000 or so compid to create the large data set todo. This takes about 40 mins to run, so I wonder if there is a faster way to write this loop or an alternative to it.
Many thanks in advance
crsp_td_net$date_crsp_orig <-crsp_td_net$date_crsp
df <- full_join(crsp_td_net, by="date_crsp", expand.grid(compid = unique(crsp_td_net$compid), date_crsp=unique(ff_35f$date_crsp)) )
df<- df%>% filter(between(date_crsp, min(date_crsp_orig, na.rm=TRUE), max(date_crsp_orig, na.rm=TRUE)))
df<- df%>%filter(!compid.x=="NA")%>% select(-compid.y)%>% distinct()%>%arrange(compid.x,date_crsp)