My question may be complicated, please be patient to read it.
I am dealing with the following case,I have two time data set of financial Time series from 2 exchanges (New York and London)
Both data set looks like the following:
London data set:
Date time.second Price
2015-01-05 32417 238.2
2015-01-05 32418 238.2
2015-01-05 32421 238.2
2015-01-05 32422 238.2
2015-01-05 32423 238.2
2015-01-05 32425 238.2
2015-01-05 32427 238.2
2015-01-05 32431 238.2
2015-01-05 32435 238.47
2015-01-05 32436 238.47
New York data set:
NY.Date Time Price
2015-01-05 32416 1189.75
2015-01-05 32417 1189.665
2015-01-05 32418 1189.895
2015-01-05 32419 1190.15
2015-01-05 32420 1190.075
2015-01-05 32421 1190.01
2015-01-05 32422 1190.175
2015-01-05 32423 1190.12
2015-01-05 32424 1190.14
2015-01-05 32425 1190.205
2015-01-05 32426 1190.2
2015-01-05 32427 1190.33
2015-01-05 32428 1190.29
2015-01-05 32429 1190.28
2015-01-05 32430 1190.05
2015-01-05 32432 1190.04
As can be seen, there are 3 columns: Date, time(second), Price
What I am trying to do is that using london data set as a reference, find the data item which is nearest but earlier in New York dataset.
What do I mean by which is nearest but earlier ? I mean, for instance,
"2015-01-01","21610","15.6871" in London data set, I want to find the data in New York data set and which on the same date, and nearest but earlier or equal time, it would be helpful do look at my current program:
# I am trying to avoid using for-loop
for(i in 1:dim(london_data)[1]){ #for each row in london data set
print(i)
tempRow<-london_data[i,]
dateMatch<-(which(NY_data[,1]==tempRow[1])) # select the same date
dataNeeded<-(london_before[dateMatch,]) # subset the same date data
# find the nearest but earlier data in NY_data set
Found<-dataNeeded[which(dataNeeded[,2]<=tempRow[2]),]
# Found may be more than one row, each row is of length 3
if(length(Found)>3)
{ # Select the data, we only need "time" and "price", 2nd and 3rd
# column
# the data is in the final row of **Found**
selected<-Found[dim(Found)[1],2:3]
if(length(selected)==0) # if nothing selected, just insert 0 and 0
temp[i,]<-c(0,0)
else
temp[i,]<-selected
}
else{ # Found may only one row, of length 3
temp[i,]<-Found[2:3] # just insert what we want
}
print(paste("time is", as.numeric(selected[1]))) #Monitor the loop
}
res<-cbind(london_data,temp)
colnames(res)<-c("LondonDate","LondonTime","LondonPrice","NYTime","NYPrice")
The correct output of the above listed data set is**(Only partially)**:
"LondonDate","LondonTime","LondonPrice","NYTime","NYPrice"
[1,] "2015-01-05" "32417" "238.2" "32417" "1189.665"
[2,] "2015-01-05" "32418" "238.2" "32418" "1189.895"
[3,] "2015-01-05" "32421" "238.2" "32421" "1190.01"
[4,] "2015-01-05" "32422" "238.2" "32422" "1190.175"
[5,] "2015-01-05" "32423" "238.2" "32423" "1190.12"
[6,] "2015-01-05" "32425" "238.2" "32425" "1190.205"
[7,] "2015-01-05" "32427" "238.2" "32427" "1190.33"
[8,] "2015-01-05" "32431" "238.2" "32430" "1190.05"
[9,] "2015-01-05" "32435" "238.47" "32432" "1190.04"
[10,] "2015-01-05" "32436" "238.47" "32432" "1190.04"
My problem is that, London data set has more than 5,000,000 columns, I tried to avoid for-loop but I still need at least one, above program runs successfully but it took about 24 hours.
How can I avoid using for-loops and accelerate the program ?
Your kind help will be well appreciated.