3

I have two Pandas TimeSeries: x, and y, which I would like to sync "as of". I would like to find for every element in x the latest (by index) element in y that preceeds it (by index value). For example, I would like to compute this new_x:

x       new_x
----    -----
13:01   13:00  
14:02   14:00

y
----
13:00
13:01
13:30
14:00

I am looking for a vectorized solution, not a Python loop. The time values are based on Numpy datetime64. The y array's length is in the order of millions, so O(n^2) solutions are probably not practical.

Yariv
  • 12,945
  • 19
  • 54
  • 75
  • 1
    This seems to do what you're asking: https://gist.github.com/3686236 – NPE Jan 24 '13 at 09:25
  • This question is a duplicate of http://stackoverflow.com/questions/12322289/kdb-like-asof-join-for-timeseries-data-in-pandas – Yariv Jan 24 '13 at 09:56

2 Answers2

2

In some circles this operation is known as the "asof" join. Here is an implementation:

def diffCols(df1, df2):
    """ Find columns in df1 not present in df2
    Return df1.columns  - df2.columns maintaining the order which the resulting
    columns appears in df1.

    Parameters:
    ----------
    df1 : pandas dataframe object
    df2 : pandas dataframe objct
    Pandas already offers df1.columns - df2.columns, but unfortunately
    the original order of the resulting columns is not maintained.
    """
    return [i for i in df1.columns if i not in df2.columns]


def aj(df1, df2, overwriteColumns=True, inplace=False):
    """ KDB+ like asof join.
    Finds prevailing values of df2 asof df1's index. The resulting dataframe
    will have same number of rows as df1.

    Parameters
    ----------
    df1 : Pandas dataframe
    df2 : Pandas dataframe
    overwriteColumns : boolean, default True
         The columns of df2 will overwrite the columns of df1 if they have the same
         name unless overwriteColumns is set to False. In that case, this function
         will only join columns of df2 which are not present in df1.
    inplace : boolean, default False.
        If True, adds columns of df2 to df1. Otherwise, create a new dataframe with
        columns of both df1 and df2.

    *Assumes both df1 and df2 have datetime64 index. """
    joiner = lambda x : x.asof(df1.index)
    if not overwriteColumns:
        # Get columns of df2 not present in df1
        cols = diffCols(df2, df1)
        if len(cols) > 0:
            df2 = df2.ix[:,cols]
    result = df2.apply(joiner)
    if inplace:
        for i in result.columns:
            df1[i] = result[i]
        return df1
    else:
        return result

Internally, this uses pandas.Series.asof().

NPE
  • 486,780
  • 108
  • 951
  • 1,012
1

What about using Series.searchsorted() to return the index of y where you would insert x. You could then subtract one from that value and use it to index y.

In [1]: x
Out[1]:
0    1301
1    1402

In [2]: y
Out[2]:
0    1300
1    1301
2    1330
3    1400

In [3]: y[y.searchsorted(x)-1]
Out[3]:
0    1300
3    1400

note: the above example uses int64 Series

Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • Thanks. See also: http://wesmckinney.com/blog/?p=485 for a "O(n log n) NA-friendly time series “as of” using array operations" by McKinney. But `asof()` should be faster because `searchsorted()` uses a binary search. – Yariv Jan 24 '13 at 16:53