0

I am merging two files on exactly matching timestamps. These are field data. I am trying to find fastest way of merging them accurately. I am listing here my trial and error results. I would like to know any better and fast method there. Remember, file1 has around 20 million rows, and file2 has 2 million. You guessed it correctly, I am not and cannot share them here. My code:

file1.shape
Out[13]: (19069591, 11) # 19.1 million rows

file2.shape
Out[14]: (1987321, 44) # 1.9 million rows

Method1: pd.merge (Correct result but slower)

%timeit df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)
28.3 s ± 2.19 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

df = pd.merge(file1,file2,how='inner',left_index=True,right_index=True)  
df.shape
Out[17]: (1776798, 55) # 1.7 million rows

Method2: pd.merge_asof(4 times faster but wrong result) This method needs sorting of both files. Hence, two additional lines of code. But these two lines executed with in 2 or 3 s, not counted towards the %timeit of this method

file1.sort_index(inplace=True,ascending=True)
file2.sort_index(inplace=True,ascending=True)

%timeit pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))

8.72 s ± 1.05 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

df = pd.merge_asof(file1,file2,left_index=True,right_index=True,direction='nearest',tolerance=pd.Timedelta('0s'))
df.shape
(19069591, 55) # 19 million rows (NOT CORRECT)

Method2 is definitely faster compared to Method1. But there is an issue. Method1 result is correct. Because its resultant df size 1.7 million rows is lesser in size compared to file1 (1.9) and file2 (19). Though Method2 is faster but result is wrong as it considers all rows from the both data frames. Two questions here: 1. How to get exact matches out of the Method2 approach? 2. Is there any faster method than these two?

Solution updates Based on pavel suggestion:

%timeit df = pd.merge_asof(file1,file2,left_index=True,right_index=True,allow_exact_matches=True,direction='nearest')
10.7 s ± 758 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

df.shape
Out[26]: (19069591, 55)
Mainland
  • 4,110
  • 3
  • 25
  • 56
  • Post a brief bit of code that generates dummy test vectors of the appropriate shape, please? https://stackoverflow.com/help/minimal-reproducible-example – J_H Jun 17 '22 at 01:06
  • Did you try https://stackoverflow.com/questions/40860457/improve-pandas-merge-performance (comment section)? – teedak8s Jun 17 '22 at 01:07
  • If you need to sort a dataframe of 19 million records beforehand every time, the merge_asof is not 4 times faster in the practical sense. How long does the sorting take? – NotAName Jun 17 '22 at 01:07
  • @pavel I mentioned in my question. – Mainland Jun 17 '22 at 01:09
  • @AkhilJain I tried that. Problem is, I don't want to make any of them as a reference. But I want to merge where both files match exactly. It is like `inner` – Mainland Jun 17 '22 at 01:11
  • @Mainland, try removing the tolerance parameter and using `allow_exact_matches=True` – NotAName Jun 17 '22 at 01:12
  • @pavel please check my updated question. – Mainland Jun 17 '22 at 01:18
  • can you use Dask in your code? – NoobVB Jun 17 '22 at 07:40

0 Answers0