How to join two table in pandas based on time with delay

Question

I have actually two CSV files, df1 and df2.

When I use the command: df1=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])

I get:

                      index   likes    ...      user_screen_name  sentiment
created_at                            ...
2019-02-27 05:36:29      0   94574    ...       realDonaldTrump   positive
2019-02-27 05:31:21      1   61666    ...       realDonaldTrump   negative
2019-02-26 18:08:14      2  151844    ...       realDonaldTrump   positive
2019-02-26 04:50:37      3  184597    ...       realDonaldTrump   positive
2019-02-26 04:50:36      4  181641    ...       realDonaldTrump   negative
       ...             ...    ...     ...           ...             ...

When I use the command:

df2=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])

I get:

                     Unnamed: 0    Close     Open  Volume     Day
created_at
2019-03-01 00:47:00           0  2784.49  2784.49     NaN  STABLE
2019-03-01 00:21:00           1  2784.49  2784.49     NaN  STABLE
2019-03-01 00:20:00           2  2784.49  2784.49     NaN  STABLE
2019-03-01 00:19:00           3  2784.49  2784.49     NaN  STABLE
2019-03-01 00:18:00           4  2784.49  2784.49     NaN  STABLE
2019-03-01 00:17:00           5  2784.49  2784.49     NaN  STABLE
        ...                 ...    ...      ...       ...    ...

As you know, when you use the command:

df3=df1.join(df2)

You will join the two tables based on the index "created_at" with the exact date and time in the two tables.

But I would like to have the result, with a delay, for an example, of 2 min.

For example, instead of:

file df1                   file df2
created_at                 created_at
2019-02-27 05:36:29        2019-02-27 05:36:29

I would like to have the two tables join like this:

file df1                   file df2
created_at                 created_at
2019-02-27 05:36:29        2019-02-27 05:38:29

It is important for my data that the time df1 is before df2. I mean it is important that the event df1 is before df2.

it is neccessary for the two dataframe to join, then the two entries in both columns have to be exactly the same. — MEdwin, Mar 04 '19 at 11:53
@MEdwin ok thank you. But unfortunately I cannot see how to join these two tables with the delay. Do you have another idea? — stefan, Mar 04 '19 at 11:57
Possible duplicate of [Merging two dataframes based on a date between two other dates without a common column](https://stackoverflow.com/questions/43593554/merging-two-dataframes-based-on-a-date-between-two-other-dates-without-a-common) — gold_cy, Mar 04 '19 at 11:57
and have you looked in to this --> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html — gold_cy, Mar 04 '19 at 11:58
@aws_apprentice - I check the post "Merging...", the problem is that it is important for mw two the time df1 before the time df2. I am interested by the reaction of df2 after the event in df1. I have already checked your link. The problem is the same described above in my comment. I am not a fan of using the attribute `tolerence` — stefan, Mar 04 '19 at 12:07
@aws_apprentice your link "Merging two..." is unfortunately not helping me and I am not sure it is a duplicated question — stefan, Mar 04 '19 at 12:12
look closely, that one asks to join on a time window, it is exactly what you are looking for — gold_cy, Mar 04 '19 at 12:45
@aws_apprentice So if I have file with 10'000 different date I should enter it manually ? Like in your example you provided? — stefan, Mar 04 '19 at 14:17
SQL can easily handle this if you use outer apply and create a time window for the join. — MEdwin, Mar 04 '19 at 14:51

score 2 · Answer 1 · answered Mar 04 '19 at 15:54

For small dataframes, Merging two dataframes based on a date between two other dates without a common column contains a nice solution. Simply it uses a cartesian product of both data frames, and will not scale nicely with larger data frames.

A possible optimization would be to add rounded datetime columns to the dataframes, and join on those columns. As a join is very more efficient than a cartesian product, the memory and execution time gain should be noticeable.

What you want is (pseudo code here):

df1.created_at <= df2.created_at and df2.created_at - df1.created_at <= 2mins

I would add in both dataframes a ref column defined as (still pseudo code): created_at - (created_at.minute % 2)

It lines in both dataframes share the same ref value, they should have dates distant from less that 4 minutes. But this will not pick all the expected cases, because dates can be closer than 2 minutes and fall in 2 different slots. To cope with that, I suggest to have a ref2 column in df1 defined as ref1 + 2minutes and do a second join on df1.ref == df1.ref2. It will be enough because you want the df1 event to be before df2 one, else we would need a 3rd column ref3 = ref1 - 2minutes.

Then as in the referenced answer, we can select the lines actually meeting the requirement and contact the two joined data frames.

Pandas code could be:

# create auxilliary columns
df1['ref'] = df1.index - pd.to_timedelta(df1.index.minute % 2, unit='m')
df1['ref2'] = df1.ref + pd.Timedelta(minutes=2)

df2['ref'] = df2.index - pd.to_timedelta(df2.index.minute % 2, unit='m')
df2.index.name = 'created_at_2'
df2 = df2.reset_index().set_index('ref')

# join on ref and select the relevant lines
x1 = df1.join(df2, on='ref', how='inner')
x1 = x1.loc[(x1.index <= x1.created_at_2)
            & (x1.created_at_2 - x1.index <= pd.Timedelta(minutes=2))]

# join on ref2 and select the relevant lines
x2 = df1.join(df2, on='ref2', how='inner')
x2 = x2.loc[(x2.index <= x2.created_at_2)
            & (x2.created_at_2 - x2.index <= pd.Timedelta(minutes=1))]

# concatenate the partial result and clean the resulting dataframe
merged = pd.concat([x1, x2]).drop(columns=['ref', 'ref2'])
merged.index.name = 'created_at'

How to join two table in pandas based on time with delay

1 Answers1

Linked