I am merging two files on exactly matching timestamps. These are field data. I am trying to find fastest way of merging them accurately. I am listing here my trial and error results. I would like to know any better and fast method there. Remember, file1 has around 20 million rows, and file2 has 2 million. You guessed it correctly, I am not and cannot share them here. My code:
file1.shape
Out[13]: (19069591, 11) # 19.1 million rows
file2.shape
Out[14]: (1987321, 44) # 1.9 million rows
Method1: pd.merge (Correct result but slower)
%timeit df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
28.3 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
df.shape
Out[17]: (1776798, 55) # 1.7 million rows
Method2: pd.merge_asof(4 times faster but wrong result)
This method needs sorting of both files. Hence, two additional lines of code. But these two lines executed with in 2 or 3 s, not counted towards the %timeit
of this method
file1.sort_index(inplace=True,ascending=True)
file2.sort_index(inplace=True,ascending=True)
%timeit pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
8.72 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
df = pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
df.shape
(19069591, 55) # 19 million rows (NOT CORRECT)
Method2 is definitely faster compared to Method1. But there is an issue. Method1 result is correct. Because its resultant df
size 1.7 million rows is lesser in size compared to file1 (1.9) and file2 (19). Though Method2 is faster but result is wrong as it considers all rows from the both data frames. Two questions here: 1. How to get exact matches out of the Method2 approach? 2. Is there any faster method than these two?
Solution updates Based on pavel
suggestion:
%timeit df = pd.merge_asof(file1,file2,left_index=True,right_index=True,allow_exact_matches=True,direction='nearest')
10.7 s ± 758 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
df.shape
Out[26]: (19069591, 55)