I have two dataframes in pandas. One is testdata and the other refdata. Both dataframes have a column named 'timestamp'. like this:
timestamp
197888.000,
197888.100,
197888.200,
197888.300,
197888.400,
...
I want to find out the proper reference data index from refdata for the each row of the test data. I use the 'apply' method for this.
index = testdata.apply(lambda x: locate_refindex(x.timestamp, refdata, threshold), axis=1)
in the function locate_refindex, the very key algorithm is the statement as below:
def locate_refindex(timestamp, refdata, threshold):
...
slice_data = refdata[abs(refdata['timestamp'] - timestamp) <= threshold]
...
which finds out the nearest refdata row(s) to the test data. for instance, if I have the testdata row with the timestamp 197888.100, the threshold 0.099, then the slice_data should be two items.
The problem I met is the this algorithm is very slow, about 1300ns. if the test data is big enough up to million rows, the algorithm is quite bad.
I wonder if there are some other ways to make this faster. Thank you very much.
-Stefan