1

I have actually two CSV files, df1 and df2.

When I use the command: df1=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])

I get:

                      index   likes    ...      user_screen_name  sentiment
created_at                            ...
2019-02-27 05:36:29      0   94574    ...       realDonaldTrump   positive
2019-02-27 05:31:21      1   61666    ...       realDonaldTrump   negative
2019-02-26 18:08:14      2  151844    ...       realDonaldTrump   positive
2019-02-26 04:50:37      3  184597    ...       realDonaldTrump   positive
2019-02-26 04:50:36      4  181641    ...       realDonaldTrump   negative
       ...             ...    ...     ...           ...             ...

When I use the command:

df2=pd.read_csv("path",index_col="created_at",parse_dates=["created_at"])

I get:

                     Unnamed: 0    Close     Open  Volume     Day
created_at
2019-03-01 00:47:00           0  2784.49  2784.49     NaN  STABLE
2019-03-01 00:21:00           1  2784.49  2784.49     NaN  STABLE
2019-03-01 00:20:00           2  2784.49  2784.49     NaN  STABLE
2019-03-01 00:19:00           3  2784.49  2784.49     NaN  STABLE
2019-03-01 00:18:00           4  2784.49  2784.49     NaN  STABLE
2019-03-01 00:17:00           5  2784.49  2784.49     NaN  STABLE
        ...                 ...    ...      ...       ...    ...

As you know, when you use the command:

df3=df1.join(df2)

You will join the two tables based on the index "created_at" with the exact date and time in the two tables.

But I would like to have the result, with a delay, for an example, of 2 min.

For example, instead of:

file df1                   file df2
created_at                 created_at
2019-02-27 05:36:29        2019-02-27 05:36:29

I would like to have the two tables join like this:

file df1                   file df2
created_at                 created_at
2019-02-27 05:36:29        2019-02-27 05:38:29

It is important for my data that the time df1 is before df2. I mean it is important that the event df1 is before df2.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
stefan
  • 61
  • 1
  • 7
  • it is neccessary for the two dataframe to join, then the two entries in both columns have to be exactly the same. – MEdwin Mar 04 '19 at 11:53
  • @MEdwin ok thank you. But unfortunately I cannot see how to join these two tables with the delay. Do you have another idea? – stefan Mar 04 '19 at 11:57
  • Possible duplicate of [Merging two dataframes based on a date between two other dates without a common column](https://stackoverflow.com/questions/43593554/merging-two-dataframes-based-on-a-date-between-two-other-dates-without-a-common) – gold_cy Mar 04 '19 at 11:57
  • and have you looked in to this --> https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html – gold_cy Mar 04 '19 at 11:58
  • @aws_apprentice - I check the post "Merging...", the problem is that it is important for mw two the time df1 before the time df2. I am interested by the reaction of df2 after the event in df1. I have already checked your link. The problem is the same described above in my comment. I am not a fan of using the attribute `tolerence` – stefan Mar 04 '19 at 12:07
  • @aws_apprentice your link "Merging two..." is unfortunately not helping me and I am not sure it is a duplicated question – stefan Mar 04 '19 at 12:12
  • look closely, that one asks to join on a time window, it is exactly what you are looking for – gold_cy Mar 04 '19 at 12:45
  • @aws_apprentice So if I have file with 10'000 different date I should enter it manually ? Like in your example you provided? – stefan Mar 04 '19 at 14:17
  • SQL can easily handle this if you use outer apply and create a time window for the join. – MEdwin Mar 04 '19 at 14:51

1 Answers1

2

For small dataframes, Merging two dataframes based on a date between two other dates without a common column contains a nice solution. Simply it uses a cartesian product of both data frames, and will not scale nicely with larger data frames.

A possible optimization would be to add rounded datetime columns to the dataframes, and join on those columns. As a join is very more efficient than a cartesian product, the memory and execution time gain should be noticeable.

What you want is (pseudo code here):

df1.created_at <= df2.created_at and df2.created_at - df1.created_at <= 2mins

I would add in both dataframes a ref column defined as (still pseudo code): created_at - (created_at.minute % 2)

It lines in both dataframes share the same ref value, they should have dates distant from less that 4 minutes. But this will not pick all the expected cases, because dates can be closer than 2 minutes and fall in 2 different slots. To cope with that, I suggest to have a ref2 column in df1 defined as ref1 + 2minutes and do a second join on df1.ref == df1.ref2. It will be enough because you want the df1 event to be before df2 one, else we would need a 3rd column ref3 = ref1 - 2minutes.

Then as in the referenced answer, we can select the lines actually meeting the requirement and contact the two joined data frames.

Pandas code could be:

# create auxilliary columns
df1['ref'] = df1.index - pd.to_timedelta(df1.index.minute % 2, unit='m')
df1['ref2'] = df1.ref + pd.Timedelta(minutes=2)

df2['ref'] = df2.index - pd.to_timedelta(df2.index.minute % 2, unit='m')
df2.index.name = 'created_at_2'
df2 = df2.reset_index().set_index('ref')

# join on ref and select the relevant lines
x1 = df1.join(df2, on='ref', how='inner')
x1 = x1.loc[(x1.index <= x1.created_at_2)
            & (x1.created_at_2 - x1.index <= pd.Timedelta(minutes=2))]

# join on ref2 and select the relevant lines
x2 = df1.join(df2, on='ref2', how='inner')
x2 = x2.loc[(x2.index <= x2.created_at_2)
            & (x2.created_at_2 - x2.index <= pd.Timedelta(minutes=1))]

# concatenate the partial result and clean the resulting dataframe
merged = pd.concat([x1, x2]).drop(columns=['ref', 'ref2'])
merged.index.name = 'created_at'
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252