I would like to synchronize two very long data frames, performance is key in this use case. The two data frames are indexed in chronological order (this should be exploited to be as fast as possible) using datetimes or Timestamps.
One way to synch is provided in this example:
import pandas as pd
df1=pd.DataFrame({'A':[1,2,3,4,5,6], 'B':[1,5,3,4,5,7]}, index=pd.date_range('20140101 101501', freq='u', periods=6))
df2=pd.DataFrame({'D':[10,2,30,4,5,10], 'F':[1,5,3,4,5,70]}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6))
# synch data frames
df3=df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')
My question is if this is the most efficient way to do it? I am ready to explore other solutions (e.g. using numpy or cython) if there are faster ways to solve this task.
Thanks
Note: time-stamps are not in general equally spaced (as in the example above), the method should also work in this case
Comment after reading the answers
I think there are many use cases in which neither align nor merge or join help. The point is to not use DB related semantics for aligning (which for timeseries are not so relevant in my opinion). For me aligning means map series A into B and have a way to deal with missing values (typically sample and hold method), align and join cause a not wanted effects like several timestamps repeated as a result of joining. I still do not have a perfect solution, but it seems np.searchsorted can help (it is much faster than using several calls to join / align to do what I need). I could not find a pandas way to do this up to now.
How can I map A into B so that B so that the result has all timestamps of A and B but no repetitions (except those which are already in A and B)?
Another typical use case is sample and hold synch, which can be solved in an efficient way as follows (synch A with B, i.e. take for every timestamp in A the corresponding values in B:
idx=np.searchsorted(B.index.values, A.index.values, side='right')-1
df=A.copy()
for i in B:
df[i]=B[i].ix[idx].values
the result df contains the same index of A and the synchronized values in B.
Is there an effective way to do such things directly in pandas?