1

I have two datasets:

df1 and df2, where I would like the columns to merge together from df1, if the datetimes are within 20 seconds of df2

df1





Connect                 Ended


4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM

3/31/2020 11:00:08 AM   3/31/2020 11:00:10 AM

4/1/2020 10:00:05 PM    4/1/2020 12:00:05 PM

df2

Start                   End


4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM

3/31/2020 11:00:10 AM   3/31/2020 11:00:14 AM

Desired Output:

df3

Match_Start1             Match_End1                     Match_Start2              Match_End2



4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM          4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM

3/31/2020 11:00:08 AM   3/31/2020 11:00:10 AM        3/31/2020 11:00:10 AM   3/31/2020 11:00:14 AM    

df4 (unmatched)

Unmatched_Start         Unmatched_end  



4/1/2020 10:00:05 PM    4/1/2020 12:00:05 PM

Dput:

df1

   :
   '  Connect   Ended\n0    4/6/2020 1:15:21 PM     4/6/2020 2:05:18 PM\n1  
   3/31/2020 11:08:08 AM  3/31/2020 11:00:10 AM\n2   4/1/2020 10:00:05 PM    4/1/2020 12:00:05    PM'

df2

  Out[117]:
  '    Start    End\n0    4/6/2020 1:15:21 PM    4/6/2020 2:05:18 PM\n1     
  3/31/2020 11:08:08 AM  3/31/2020 11:00:14 AM\n2                    NaN                    NaN'

What I am thinking:

df2 = pd.merge_asof(df1, df2, on="Connect", by = "Ended", 
tolerance=pd.Timedelta('20s'), direction='backward')

However, how do I incorporate the condition of the 20 seconds, as well as show the unmatched dataset?

Any suggestion is appreciated

Lynn
  • 4,292
  • 5
  • 21
  • 44
  • possible duplicate https://stackoverflow.com/questions/30627968/merge-pandas-dataframes-where-one-value-is-between-two-others – user3135645 Apr 15 '20 at 23:49
  • Hi, I do not believe this is a duplicate because I do not see the unmatched dataset. My question is also asking to link the two datasets based on if they are within 20 seconds of one another. – Lynn Apr 16 '20 at 00:19

1 Answers1

3

you need to read in the data and convert to datetime format - i read in the data with clipboard and parsed the dates there. Secondly, you need to sort the data by the key(in this case the key is 'Connect' for df1 and 'Start' for df2). after that pandas merge_asof should suffice. note that the merger can only occur on one key, not multiple :

sort the dataframes

df1 = df1.sort_values(['Connect','Ended'])
df2 = df2.sort_values(['Start','End'])

merge the dataframes

merger = pd.merge_asof(df1,df2,
                       left_on='Connect',
                       right_on='Start',
                       tolerance = pd.Timedelta('20s'),
                       direction='forward')

merger

     Connect                   Ended            Start              End
0   2020-03-31 11:00:08 2020-03-31 11:00:10 2020-03-31 11:00:10 2020-03-31 11:00:14
1   2020-04-01 22:00:05 2020-04-01 12:00:05        NaT             NaT
2   2020-04-06 13:15:21 2020-04-06 14:05:18 2020-04-06 13:15:21 2020-04-06 14:05:18

should be easy to pick the matched and unmatched rows :

matched = merger.dropna()
matched


         Connect              Ended                    Start         End
0   2020-03-31 11:00:08 2020-03-31 11:00:10 2020-03-31 11:00:10 2020-03-31 11:00:14
2   2020-04-06 13:15:21 2020-04-06 14:05:18 2020-04-06 13:15:21 2020-04-06 14:05:18

unmatched = merger.loc[merger.isna().any(axis=1)]
unmatched

    Connect               Ended           Start End
1   2020-04-01 22:00:05 2020-04-01 12:00:05 NaT NaT

hope it suffices... the docs have more examples to guide u thru if u get stomped

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • 1
    Ah ok thank you! I was on the right track, but got stumped along the way! Thank you!! – Lynn Apr 16 '20 at 03:20
  • 1
    yes u were on the right track. we all get stomped sometimes. time to get back to work. cheers! – sammywemmy Apr 16 '20 at 03:20
  • Hi @Sammywemmy, just wondering, the tolerance timedelta('20s') , what exactly does it mean? does it mean that these two 4/9/2020 9:40:33 AM 4/9/2020 9:52:35 AM 4/9/2020 9:40:34 AM 4/9/2020 9:52:35 AM should match since there's only a one sec difference? because when I ran the code, these timestamps were in the 'unmatched' dataset. I was thinking the 20 second tolerance would include it. I am researching the documentation as well. Thanks – Lynn Apr 16 '20 at 19:35
  • 1
    what is ur ```direction``` parameter set as ? ```forward``` ? ```nearest``` ? – sammywemmy Apr 16 '20 at 21:08
  • forward. Maybe this is why, perhaps it should be 'nearest' I am still learning this – Lynn Apr 16 '20 at 21:40