what is the most efficient way to synchronize two large data frames in pandas?

Question

I would like to synchronize two very long data frames, performance is key in this use case. The two data frames are indexed in chronological order (this should be exploited to be as fast as possible) using datetimes or Timestamps.

One way to synch is provided in this example:

import pandas as pd
df1=pd.DataFrame({'A':[1,2,3,4,5,6], 'B':[1,5,3,4,5,7]}, index=pd.date_range('20140101 101501', freq='u', periods=6))
df2=pd.DataFrame({'D':[10,2,30,4,5,10], 'F':[1,5,3,4,5,70]}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6))

# synch data frames
df3=df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')

My question is if this is the most efficient way to do it? I am ready to explore other solutions (e.g. using numpy or cython) if there are faster ways to solve this task.

Thanks

Note: time-stamps are not in general equally spaced (as in the example above), the method should also work in this case

Comment after reading the answers

I think there are many use cases in which neither align nor merge or join help. The point is to not use DB related semantics for aligning (which for timeseries are not so relevant in my opinion). For me aligning means map series A into B and have a way to deal with missing values (typically sample and hold method), align and join cause a not wanted effects like several timestamps repeated as a result of joining. I still do not have a perfect solution, but it seems np.searchsorted can help (it is much faster than using several calls to join / align to do what I need). I could not find a pandas way to do this up to now.

How can I map A into B so that B so that the result has all timestamps of A and B but no repetitions (except those which are already in A and B)?

Another typical use case is sample and hold synch, which can be solved in an efficient way as follows (synch A with B, i.e. take for every timestamp in A the corresponding values in B:

idx=np.searchsorted(B.index.values, A.index.values, side='right')-1
df=A.copy()
for i in B:
    df[i]=B[i].ix[idx].values

the result df contains the same index of A and the synchronized values in B.

Is there an effective way to do such things directly in pandas?

There's also the [`join`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.join.html) method but I didn't test if this is faster. Most of the pandas code is already cythonized so I don't think you'll be able to improve the time needed much. — Midnighter, Aug 10 '14 at 11:46
@Midnighter: it seems join is the fastest, %timeit df1.join(df2, how='outer') # 100 loops, best of 3: 10.1 ms per loop — Mannaggia, Aug 11 '14 at 08:28

score 5 · Accepted Answer · answered Aug 10 '14 at 13:28

If you need to synchronize then, use align, docs are here. Otherwise merge is a good option.

In [18]: N=100000

In [19]: df1=pd.DataFrame({'A':[1,2,3,4,5,6]*N, 'B':[1,5,3,4,5,7]*N}, index=pd.date_range('20140101 101501', freq='u', periods=6*N))

In [20]: df2=pd.DataFrame({'D':[10,2,30,4,5,10]*N, 'F':[1,5,3,4,5,70]*N}, index=pd.date_range('20140101 101501.000003', freq='u', periods=6*N))

In [21]: %timeit df1.merge(df2, how='outer', right_index=True, left_index=True).fillna(method='ffill')
10 loops, best of 3: 69.3 ms per loop

In [22]: %timeit df1.align(df2)
10 loops, best of 3: 36.5 ms per loop

In [24]: pd.set_option('max_rows',10)

In [25]: x, y = df1.align(df2)

In [26]: x
Out[26]: 
                             A   B   D   F
2014-01-01 10:15:01          1   1 NaN NaN
2014-01-01 10:15:01.000001   2   5 NaN NaN
2014-01-01 10:15:01.000002   3   3 NaN NaN
2014-01-01 10:15:01.000003   4   4 NaN NaN
2014-01-01 10:15:01.000004   5   5 NaN NaN
...                         ..  ..  ..  ..
2014-01-01 10:15:01.599998   5   5 NaN NaN
2014-01-01 10:15:01.599999   6   7 NaN NaN
2014-01-01 10:15:01.600000 NaN NaN NaN NaN
2014-01-01 10:15:01.600001 NaN NaN NaN NaN
2014-01-01 10:15:01.600002 NaN NaN NaN NaN

[600003 rows x 4 columns]

In [27]: y
Out[27]: 
                             A   B   D   F
2014-01-01 10:15:01        NaN NaN NaN NaN
2014-01-01 10:15:01.000001 NaN NaN NaN NaN
2014-01-01 10:15:01.000002 NaN NaN NaN NaN
2014-01-01 10:15:01.000003 NaN NaN  10   1
2014-01-01 10:15:01.000004 NaN NaN   2   5
...                         ..  ..  ..  ..
2014-01-01 10:15:01.599998 NaN NaN   2   5
2014-01-01 10:15:01.599999 NaN NaN  30   3
2014-01-01 10:15:01.600000 NaN NaN   4   4
2014-01-01 10:15:01.600001 NaN NaN   5   5
2014-01-01 10:15:01.600002 NaN NaN  10  70

[600003 rows x 4 columns]

Align and join is not what I need, I need to map timeseries A into B, no relational DB logic is required, I do not want to create new timestamps (repeated). New timestamps are created by having repeated timestamps in both A and B. — Mannaggia, Sep 02 '14 at 15:15

score 2 · Answer 2 · edited May 23 '17 at 12:34

If you wish to use the index of one of your DataFrames as pattern for synchronizing, maybe useful:

df3 = df1.iloc[df1.index.isin(df2.index),]

Note: I guess shape of df1 > shape of df2

In the previous code snippet, you get the elements in df1 and df2 but if you want to add new indexes maybe you prefer doing:

new_indexes = df1.index.diff(df2.index) # indexes of df1 and not in df2
default_values = np.zeros((new_indexes.shape[0],df2.shape[1])) 
df2 = df2.append(pd.DataFrame(default_values , index=new_indexes)).sort(axis=0)

You can see another way to synchronize in this post

Nikolai Zaitsev · Answer 3 · 2017-03-26T21:14:33.567

To my view sync of time series is a very simple procedure. Assume ts# (#=0,1,2) to be filled with

ts#[0,:] - time

ts#[1,:] - ask

ts#[2,:] - bid

ts#[3,:] - asksz

ts#[4,:] - bidsz

output is

totts[0,:] - sync time

totts[1-4,:] - ask/bid/asksz/bidsz of ts0

totts[5-8,:] - ask/bid/asksz/bidsz of ts1

totts[9-12,:] - ask/bid/asksz/bidsz of ts2

function:

def syncTS(ts0,ts1,ts2):

    ti0 = ts0[0,:]
    ti1 = ts1[0,:]
    ti2 = ts2[0,:]

    totti = np.union1d(ti0, ti1)
    totti = np.union1d(totti,ti2)

    totts = np.ndarray((13,len(totti)))

    it0=it1=it2=0
    nT0=len(ti0)-1
    nT1=len(ti1)-1
    nT2=len(ti2)-1

    for it,tim in enumerate(totti):
        if tim >= ti0[it0] and it0 < nT0:
            it0+=1

        if tim >= ti1[it1] and it1 < nT1:
            it1 += 1

        if tim >= ti2[it2] and it2 < nT2:
            it2 += 1

        totts[0, it] = tim
        for k in range(1,5):
            totts[k, it] = ts0[k, it0]
            totts[k + 4, it] = ts1[k, it1]
            totts[k + 8, it] = ts2[k, it2]

    return totts

what is the most efficient way to synchronize two large data frames in pandas?

3 Answers3