2

My question may be complicated, please be patient to read it.

I am dealing with the following case,I have two time data set of financial Time series from 2 exchanges (New York and London)

Both data set looks like the following:

London data set:

Date        time.second Price
2015-01-05  32417   238.2
2015-01-05  32418   238.2
2015-01-05  32421   238.2
2015-01-05  32422   238.2
2015-01-05  32423   238.2
2015-01-05  32425   238.2
2015-01-05  32427   238.2
2015-01-05  32431   238.2
2015-01-05  32435   238.47
2015-01-05  32436   238.47

New York data set:

NY.Date     Time    Price
2015-01-05  32416   1189.75
2015-01-05  32417   1189.665
2015-01-05  32418   1189.895
2015-01-05  32419   1190.15
2015-01-05  32420   1190.075
2015-01-05  32421   1190.01
2015-01-05  32422   1190.175
2015-01-05  32423   1190.12
2015-01-05  32424   1190.14
2015-01-05  32425   1190.205
2015-01-05  32426   1190.2
2015-01-05  32427   1190.33
2015-01-05  32428   1190.29
2015-01-05  32429   1190.28
2015-01-05  32430   1190.05
2015-01-05  32432   1190.04

As can be seen, there are 3 columns: Date, time(second), Price

What I am trying to do is that using london data set as a reference, find the data item which is nearest but earlier in New York dataset.

What do I mean by which is nearest but earlier ? I mean, for instance,

"2015-01-01","21610","15.6871" in London data set, I want to find the data in New York data set and which on the same date, and nearest but earlier or equal time, it would be helpful do look at my current program:

# I am trying to avoid using for-loop
for(i in 1:dim(london_data)[1]){ #for each row in london data set
    print(i)
    tempRow<-london_data[i,]
    dateMatch<-(which(NY_data[,1]==tempRow[1])) # select the same date
    dataNeeded<-(london_before[dateMatch,]) # subset the same date data
    # find the nearest but earlier data in NY_data set
    Found<-dataNeeded[which(dataNeeded[,2]<=tempRow[2]),] 
    # Found may be more than one row, each row is of length 3
    if(length(Found)>3)
    {    # Select the data, we only need "time" and "price", 2nd and 3rd  
         # column
         # the data is in the final row of **Found**
         selected<-Found[dim(Found)[1],2:3] 
         if(length(selected)==0) # if nothing selected, just insert 0 and 0
             temp[i,]<-c(0,0)
         else
            temp[i,]<-selected
     }
     else{ # Found may only one row, of length 3
         temp[i,]<-Found[2:3] # just insert what we want
     }
   print(paste("time is", as.numeric(selected[1]))) #Monitor the loop
 }
 res<-cbind(london_data,temp)
 colnames(res)<-c("LondonDate","LondonTime","LondonPrice","NYTime","NYPrice")

The correct output of the above listed data set is**(Only partially)**:

      "LondonDate","LondonTime","LondonPrice","NYTime","NYPrice"
 [1,] "2015-01-05" "32417"      "238.2"       "32417"    "1189.665" 
 [2,] "2015-01-05" "32418"      "238.2"       "32418"    "1189.895" 
 [3,] "2015-01-05" "32421"      "238.2"       "32421"    "1190.01"  
 [4,] "2015-01-05" "32422"      "238.2"       "32422"    "1190.175" 
 [5,] "2015-01-05" "32423"      "238.2"       "32423"    "1190.12"  
 [6,] "2015-01-05" "32425"      "238.2"       "32425"    "1190.205" 
 [7,] "2015-01-05" "32427"      "238.2"       "32427"    "1190.33"  
 [8,] "2015-01-05" "32431"      "238.2"       "32430"    "1190.05"  
 [9,] "2015-01-05" "32435"      "238.47"      "32432"    "1190.04"  
 [10,] "2015-01-05" "32436"      "238.47"      "32432"    "1190.04"

My problem is that, London data set has more than 5,000,000 columns, I tried to avoid for-loop but I still need at least one, above program runs successfully but it took about 24 hours.

How can I avoid using for-loops and accelerate the program ?

Your kind help will be well appreciated.

GeekCat
  • 309
  • 5
  • 18
  • 2
    Look at the rolling joins, `dt1[dt2, roll=TRUE]`. Somebody surely will post an answer soon. Let us know about the timing you'll have using rolling joins. – jangorecki Apr 09 '15 at 07:06
  • please provide a small (10 rows) reproducible example with `?dput` for your 2 data sets – RockScience Apr 09 '15 at 07:16
  • @RockScience I putted a small example in the question, please read it. – GeekCat Apr 09 '15 at 07:36
  • 1
    @GeekCat Best practice is to use R function `dput` instead of pasting the data because that way people can load directly your data set with the exact same format (same date format, etc...) If you do that you are more likely to have people answering your question. See http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – RockScience Apr 09 '15 at 08:03

1 Answers1

3

Building on @Jan Gorecki comment using data.table here is the solution:

library(data.table)

df1 <- data.table(Date=rep("05/01/2015", 10),   
              time.second=c(32417, 32418, 32421, 32422, 32423, 32425, 32427, 32431, 32435, 32436),  
              Price=c(238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.2, 238.47, 238.47))

df2 <- data.table(NY.Date=rep("05/01/2015", 16),    
              Time=c(32416, 32417, 32418, 32419, 32420, 32421, 32422, 32423, 32424, 32425, 32426, 32427, 32428, 32429, 32430, 32432),   
              Price=c(1189.75, 1189.665, 1189.895, 1190.15, 1190.075, 1190.01, 1190.175, 1190.12, 1190.14, 1190.205, 1190.2, 1190.33, 1190.29, 1190.28, 1190.05, 1190.04))


setnames(df2, c("Date", "time.second", "NYPrice"))

setkey(df1,"Date", "time.second")
setkey(df2,"Date", "time.second")

df2[, NYTime:=time.second]

df3 <- df2[df1, roll=TRUE]
df3
dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
  • 1
    thanks for your posting, it only runs for about 1 second rather than 24 hours, would you mind to explain more about it ? – GeekCat Apr 09 '15 at 09:01
  • 1
    Great. There is an excellent blog post on rolling joins @ http://gormanalysis.com/r-data-table-rolling-joins/ – dimitris_ps Apr 09 '15 at 09:08
  • @GeekCat don't forget to mark question as answered. – jangorecki Apr 09 '15 at 18:58
  • @dimitris_ps I met a problem that, I may not want to find the exact match data, I want to find the nearest and earlier data but not the same time.. How can I do that, I tried to shift the data, but it seems not giving right results. – GeekCat Apr 10 '15 at 09:15
  • 1
    @GeekCat you may try something like: `dt[, time.seconds := time.seconds * 10L]` on both datasets, and then `dt[, time.seconds := time.seconds - 1L]` only on one datasets, so you should never have equal values. – jangorecki Apr 10 '15 at 09:37