25

I have two dataframes, both of which contain an irregularly spaced, millisecond resolution timestamp column. My goal here is to match up the rows so that for each matched row, 1) the first time stamp is always smaller or equal to the second timestamp, and 2) the matched timestamps are the closest for all pairs of timestamps satisfying 1).

Is there any way to do this with pandas.merge?

Tom Bennett
  • 2,305
  • 5
  • 24
  • 32

4 Answers4

33

merge() can't do this kind of join, but you can use searchsorted():

Create some random timestamps: t1, t2, there are in ascending order:

import pandas as pd
import numpy as np
np.random.seed(0)

base = np.array(["2013-01-01 00:00:00"], "datetime64[ns]")

a = (np.random.rand(30)*1000000*1000).astype(np.int64)*1000000
t1 = base + a
t1.sort()

b = (np.random.rand(10)*1000000*1000).astype(np.int64)*1000000
t2 = base + b
t2.sort()

call searchsorted() to find index in t1 for every value in t2:

idx = np.searchsorted(t1, t2) - 1
mask = idx >= 0

df = pd.DataFrame({"t1":t1[idx][mask], "t2":t2[mask]})

here is the output:

                         t1                         t2
0 2013-01-02 06:49:13.287000 2013-01-03 16:29:15.612000
1 2013-01-05 16:33:07.211000 2013-01-05 21:42:30.332000
2 2013-01-07 04:47:24.561000 2013-01-07 04:53:53.948000
3 2013-01-07 14:26:03.376000 2013-01-07 17:01:35.722000
4 2013-01-07 14:26:03.376000 2013-01-07 18:22:13.996000
5 2013-01-07 14:26:03.376000 2013-01-07 18:33:55.497000
6 2013-01-08 02:24:54.113000 2013-01-08 12:23:40.299000
7 2013-01-08 21:39:49.366000 2013-01-09 14:03:53.689000
8 2013-01-11 08:06:36.638000 2013-01-11 13:09:08.078000

To view this result by graph:

import pylab as pl
pl.figure(figsize=(18, 4))
pl.vlines(pd.Series(t1), 0, 1, colors="g", lw=1)
pl.vlines(df.t1, 0.3, 0.7, colors="r", lw=2)
pl.vlines(df.t2, 0.3, 0.7, colors="b", lw=2)
pl.margins(0.02)

output:

enter image description here

The green lines are t1, blue lines are t2, red lines are selected from t1 for every t2.

HYRY
  • 94,853
  • 25
  • 187
  • 187
14

Pandas now has the function merge_asof, doing exactly what was described in the accepted answer.

Community
  • 1
  • 1
cdarlint
  • 1,485
  • 16
  • 14
  • 1
    Here's an answer that uses this method for reference. https://stackoverflow.com/questions/34880539/pandas-merging-based-on-a-timestamp-which-do-not-match-exactly/51388559#51388559 – jcp Oct 08 '19 at 14:47
4

I used a different way than HYRY:

  1. do a regular merge with outer join (how='outer');
  2. sort it by date;
  3. use fillna(method='pad') to take fill just the columns you need and 'pad' if you would like to take the previous filled row;
  4. drop all the rows you don't need from the outer join.

All this can be written in few lines:

df=pd.merge(df0, df1, on='Date', how='outer')   
df=df.sort(['Date'], ascending=[1])
headertofill=list(df1.columns.values)
df[headertofill]=df[headertofill].fillna(method='pad')
df=df[pd.isnull(df[var_from_df0_only])==False] 
brodoll
  • 1,851
  • 5
  • 22
  • 25
Yaron
  • 1,726
  • 14
  • 18
  • 1
    You haven't defined what var_from_df0_only is – Pigeon Apr 16 '16 at 09:56
  • hi Pigeon, most of the times you want to keep to original data frame (df0) and enrich it with a different one (df1).Using the outer join, you'll have some extra rows from df1, so in order to drop those just use one column that you have in df0 but don't in df1 ("var_from_df0_only"), which after the outer join have null values for the extra rows. – Yaron Apr 21 '16 at 03:10
  • Pandas sort() has been deprecated. We have to use sort_values() now instead – Noise in the street Aug 23 '17 at 22:00
4

Here is a simpler and more general method.

# data and signal are want we want to merge
keys = ['channel', 'timestamp']  # Could be simply ['timestamp']
index = data.loc[keys].set_index(keys).index  # Make index from columns to merge on
padded = signal.reindex(index, method='pad')  # Key step -- reindex with filling
joined = data.join(padded, on=keys)  # Join to data if needed
George Sovetov
  • 4,942
  • 5
  • 36
  • 57