0

I have two dataframes in pandas. One is testdata and the other refdata. Both dataframes have a column named 'timestamp'. like this:

timestamp
197888.000,
197888.100,
197888.200,
197888.300,
197888.400,
...

I want to find out the proper reference data index from refdata for the each row of the test data. I use the 'apply' method for this.

index = testdata.apply(lambda x: locate_refindex(x.timestamp, refdata, threshold), axis=1) 

in the function locate_refindex, the very key algorithm is the statement as below:

def locate_refindex(timestamp, refdata, threshold):
    ...
    slice_data = refdata[abs(refdata['timestamp'] - timestamp) <= threshold]
    ...

which finds out the nearest refdata row(s) to the test data. for instance, if I have the testdata row with the timestamp 197888.100, the threshold 0.099, then the slice_data should be two items.

The problem I met is the this algorithm is very slow, about 1300ns. if the test data is big enough up to million rows, the algorithm is quite bad.

I wonder if there are some other ways to make this faster. Thank you very much.

-Stefan

stefanzweig
  • 307
  • 2
  • 9
  • could you show and example of your dataframe's and what you expect? – ansev Oct 06 '19 at 14:17
  • In my experience `for row in df.iterrows()` is the fastest way to iterate over the whole dataframe. But you should test it. Also see here: https://stackoverflow.com/questions/7837722/what-is-the-most-efficient-way-to-loop-through-dataframes-with-pandas – PySeeker Oct 06 '19 at 14:28

0 Answers0