1

this is my first question here so please potentially excuse some beginners mistakes in asking if any.

My problem in R is the following:

  • I have a large dataset (data.table object data_op) with appx. 4.5 million observations over 15 years, which means that I have several data points on each day. The relevant entries each day are (option) strike price and the current date.
  • Additionally, I have a second object, data_sec, containing one (closing) price for each of those days, i.e. the vector is of course much smaller but covers the same range.

The relevant part of the data sets look like this: data_op:

date
04/01/1996
04/01/1996
04/01/1996
...
29/08/2014
29/08/2014

data_sec:

date            close
04/01/1996      617.70
05/01/1996      616.71
08/01/1996      618.46
...             ...
28/08/2014      1996.74
29/08/2014      2003.37

Now, for further calculations, I need to read the closing price from the second vector for each of the dates in the first. Unfortunately, something like

data_op$new <- data_sec$close[data_sec$date == data_op$date]

does not work, since R correctly tells me that 'longer object length is not a multiple of shorter object length'. Therefore, I use the for-loop workaround:

for (i in 1:length(data_op$date)){
data_op$new[i] <- data_sec$close[data_sec$date == data_op$date[i]]}

However, this takes extremely long to execute (I actually have to manually stop execution sometimes). So my question is if there is a faster/more efficient way in achieving this. I thought this should be quite straight-forward but could not find a post about such vector indexing without a loop.

I hope I made my problem clear. Any help is very much appreciated!

Thanks!

Community
  • 1
  • 1
Daedalus
  • 235
  • 1
  • 2
  • 9
  • please share data to illustrate your problem – mtoto May 01 '17 at 08:39
  • The data set is about 2 GB big, but I can outline it a bit more if it was not clear enough originally. One moment – Daedalus May 01 '17 at 08:42
  • You only need to share data big enough to illustrate your problem, not the whole thing. – mtoto May 01 '17 at 08:43
  • I outlined it above, is that sufficient? – Daedalus May 01 '17 at 08:48
  • what is your expected output? you probably can just use `merge()`. – mtoto May 01 '17 at 08:49
  • the closing prices of data_sec for each of the data_op dates, i.e. with the data above: [617.7, 617.7, 617.7 ... 2003.37, 2003.37] which is achieved by the for loop, but the execution time is quite a problem. – Daedalus May 01 '17 at 08:52
  • Unfortunately, merge() does not seem to solve my problem, I would need a combination of merge() and rep() for the number of observations in data_op per date, but unfortunately, this is not constant. – Daedalus May 01 '17 at 09:08
  • try `merge(data_op, data_sec, all.x = T, by = "date")` – mtoto May 01 '17 at 09:11
  • Thanks @mtoto, was very interesting to learn about the merge() function. Whilst reading about it however, I saw that `data_sec$close[match(data_op$date, data_sec$date)]` is more suitable for my purposes. I appreciate your help nevertheless! – Daedalus May 01 '17 at 09:59

0 Answers0