28

What methods are available to merge columns which have timestamps that do not exactly match?

DF1:

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:13 7261824 871631182

DF2:

date    start_time  employee_id session_id
01/01/2016  01/01/2016 06:03:37 7261824 871631182

I could join on the ['date', 'employee_id', 'session_id'], but sometimes the same employee will have multiple identical sessions on the same date which causes duplicates. I could drop the rows where this takes place, but I would lose valid sessions if I did.

Is there an efficient way to join if the timestamp of DF1 is <5 minutes from the timestamp of DF2, and the session_id and employee_id also match? If there is a matching record, then the timestamp will always be slightly later than DF1 because an event is triggered at some future point.

['employee_id', 'session_id', 'timestamp<5minutes']

Edit - I assumed someone would have run into this issue before.

I was thinking of doing this:

  1. Take my timestamp on each dataframe
  2. Create a column which is the timestamp + 5 minutes (rounded)
  3. Create a column which is the timestamp - 5 minutes (rounded)
  4. Create a 10 minute interval string to join the files on

    df1['low_time'] = df1['start_time'] - timedelta(minutes=5)
    df1['high_time'] = df1['start_time'] + timedelta(minutes=5)
    df1['interval_string'] = df1['low_time'].astype(str) + df1['high_time'].astype(str)
    

Does someone know how to round those 5 minute intervals to the nearest 5 minute mark?

02:59:37 - 5 min = 02:55:00

02:59:37 + 5 min = 03:05:00

interval_string = '02:55:00-03:05:00'

pd.merge(df1, df2, how = 'left', on = ['employee_id', 'session_id', 'date', 'interval_string']

Does anyone know how to round the time like that? This seems like it could work. You still match based on the date, employee, and session, and then you look for times which are basically within the same 10 minute interval or range

trench
  • 5,075
  • 12
  • 50
  • 80
  • 1
    Interesting problem. The naive solution is to merge on the timestamp rounded to the nearest 5 minutes, but this will leave some sessions as separate rows if they happen to be on different sides of the 5 minute mark. You could apply the procedure iteratively with a randomized offset, up to a certain number of iterations, which would yield better results. The most robust solution would be a clustering algorithm, but this is more difficult to implement. – Igor Raush Jan 19 '16 at 18:51
  • 1
    [This](http://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas?rq=1) could provide some inspiration. – Igor Raush Jan 19 '16 at 18:53
  • Ah so I might have to play with another library. Yeah I couldn't get my head around how the time might round on the opposite end of a cutoff. – trench Jan 20 '16 at 01:50
  • I added an approach I was considering.. not sure how to execute it exactly or if I missed something though – trench Jan 20 '16 at 19:15
  • 1
    Ideally what you'd like is to use a SQL-style `where` clause on your `join` operation that specified one of the dates using `between` and two bounds based on the other date. If it's at all feasible to do this in the database directly, or use an in-memory database like SQLite, I'd recommend it. The hacks you'll need to do it in pandas will be bad, and if you do it the database way, you can still pull the result out to pandas afterwards for interactive processing or whatever. – ely Jan 20 '16 at 19:22
  • Hmm. I can look into it. The data is coming from some csv files, not a database. I'm not too familiar with SQL but your comment might point me in the right direction to learn. – trench Jan 20 '16 at 19:25
  • 1
    @Lance is it guaranteed that the two data frames contain truly unique sessions *separately*? i.e. is the deduplication only applicable when you merge them? Or is it possible that there will be two rows for the "same" session with slightly different timestamps in the same data frame? – Igor Raush Jan 20 '16 at 19:37
  • One file contains call records and the other file contains recordings information. The recording timestamp is a bit after the call initiation timestamp, however, the data truly does match. The recording is for a specific call interaction. The session ID can be duplicated though (if the customer is transferred etc, the session ID does not change), which is why I am trying to match the employee and the session ID along with the timestamp. – trench Jan 20 '16 at 20:37
  • To more precisely answer you question - there can be duplicate sessions IDs within a small timeframe (call of a customer reaches you but is dropped back to the IVR system, and then they connect to another employee). There is virtually no chance of the session_id and employee_id to have a duplicate within a slightly different timestamp though. – trench Jan 20 '16 at 20:39
  • 1
    Sorry, still don't understand. _Within a single dataframe_, do you need to perform session deduplication (accounting for small differences in timestamps)? – Igor Raush Jan 20 '16 at 20:42
  • 1
    Another way to put it, is there a guarantee that there will be exactly zero or one duplicates in DF2 for a single session in DF1? – Igor Raush Jan 20 '16 at 20:47
  • For a single dataframe, timestamp + session id would be unique. Each row is a separate event. I can actually do a lot of the matches with the date + session id + employee name, but sometimes the same session id reaches the same employee multiple times in one day so I want to make it a bit more specific. – trench Jan 20 '16 at 20:50
  • Yes, there would be 0 duplicates for DF2 for a session in DF1. Each event in DF2 corresponds exactly one row in DF1. (It is just that the timestamps do not exactly match up and only around 50% of DF1 has a corresponding match in DF2). DF2 is a subset of DF1, technically. – trench Jan 20 '16 at 20:54

3 Answers3

42

I would try using this method in pandas:

pandas.merge_asof()

The parameters of interest for you would be direction,tolerance,left_on, and right_on

Building off @Igor answer:

import pandas as pd
from pandas import read_csv
from io import StringIO

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)



df1['date_start_time'] = pd.to_datetime(df1['date_start_time'])
df2['date_start_time'] = pd.to_datetime(df2['date_start_time'])

# converting this to the index so we can preserve the date_start_time columns so you can validate the merging logic
df1.index = df1['date_start_time']
df2.index = df2['date_start_time']
# the magic happens below, check the direction and tolerance arguments
tol = pd.Timedelta('5 minute')
pd.merge_asof(left=df1,right=df2,right_index=True,left_index=True,direction='nearest',tolerance=tol)

output

date_start_time date_start_time_x   employee_id_x   session_id_x    date_start_time_y   employee_id_y   session_id_y

2016-01-01 02:03:00 2016-01-01 02:03:00 7261824 871631182   2016-01-01 02:03:00 7261824.0   871631182.0
2016-01-01 06:03:00 2016-01-01 06:03:00 7261824 871631183   2016-01-01 06:05:00 7261824.0   871631183.0
2016-01-01 11:01:00 2016-01-01 11:01:00 7261824 871631184   2016-01-01 11:04:00 7261824.0   871631184.0
2016-01-01 14:01:00 2016-01-01 14:01:00 7261824 871631185   NaT NaN NaN
jcp
  • 841
  • 1
  • 9
  • 14
  • 1
    Quite cool. This is several years after the fact so the code which I am using this on is extremely legacy and not something I plan on updating (for fear of breaking stuff I haven't looked at in so long), but it is a nice feature and I will keep it in mind for other problems. – trench Jul 18 '18 at 12:01
  • How does this join on ('employee_id', 'session_id') first, not only on closest date_start_time? I think you need a by= parameter in the merge_asof – Hugo Zaragoza Nov 24 '21 at 06:16
17

Consider the following mini-version of your problem:

from io import StringIO
from pandas import read_csv, to_datetime

# how close do sessions have to be to be considered equal? (in minutes)
threshold = 5

# datetime column (combination of date + start_time)
dtc = [['date', 'start_time']]

# index column (above combination)
ixc = 'date_start_time'

df1 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:03:00,7261824,871631183
01/01/2016,11:01:00,7261824,871631184
01/01/2016,14:01:00,7261824,871631185
'''), parse_dates=dtc)

df2 = read_csv(StringIO(u'''
date,start_time,employee_id,session_id
01/01/2016,02:03:00,7261824,871631182
01/01/2016,06:05:00,7261824,871631183
01/01/2016,11:04:00,7261824,871631184
01/01/2016,14:10:00,7261824,871631185
'''), parse_dates=dtc)

which gives

>>> df1
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:03:00      7261824   871631183
2 2016-01-01 11:01:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
>>> df2
      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:10:00      7261824   871631185

You would like to treat df2[0:3] as duplicates of df1[0:3] when merging (since they are respectively less than 5 minutes apart), but treat df1[3] and df2[3] as separate sessions.

Solution 1: Interval Matching

This is essentially what you are suggesting in your edit. You want to map timestamps in both tables to a 10-minute interval centered on the timestamp rounded to the nearest 5 minutes.

Each interval can be represented uniquely by its midpoint, so you can merge the data frames on the timestamp rounded to the nearest 5 minutes. For example:

import numpy as np

# half-threshold in nanoseconds
threshold_ns = threshold * 60 * 1e9

# compute "interval" to which each session belongs
df1['interval'] = to_datetime(np.round(df1.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)
df2['interval'] = to_datetime(np.round(df2.date_start_time.astype(np.int64) / threshold_ns) * threshold_ns)

# join
cols = ['interval', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

which prints

             interval  employee_id  session_id
0 2016-01-01 02:05:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:00:00      7261824   871631184
3 2016-01-01 14:00:00      7261824   871631185
4 2016-01-01 11:05:00      7261824   871631184
5 2016-01-01 14:10:00      7261824   871631185

Note that this is not totally correct. The sessions df1[2] and df2[2] are not treated as duplicates although they are only 3 minutes apart. This is because they were on different sides of the interval boundary.

Solution 2: One-to-one matching

Here is another approach which depends on the condition that sessions in df1 have either zero or one duplicates in df2.

We replace timestamps in df1 with the closest timestamp in df2 which matches on employee_id and session_id and is less than 5 minutes away.

from datetime import timedelta

# get closest match from "df2" to row from "df1" (as long as it's below the threshold)
def closest(row):
    matches = df2.loc[(df2.employee_id == row.employee_id) &
                      (df2.session_id == row.session_id)]

    deltas = matches.date_start_time - row.date_start_time
    deltas = deltas.loc[deltas <= timedelta(minutes=threshold)]

    try:
        return matches.loc[deltas.idxmin()]
    except ValueError:  # no items
        return row

# replace timestamps in "df1" with closest timestamps in "df2"
df1 = df1.apply(closest, axis=1)

# join
cols = ['date_start_time', 'employee_id', 'session_id']
print df1.merge(df2, on=cols, how='outer')[cols]

which prints

      date_start_time  employee_id  session_id
0 2016-01-01 02:03:00      7261824   871631182
1 2016-01-01 06:05:00      7261824   871631183
2 2016-01-01 11:04:00      7261824   871631184
3 2016-01-01 14:01:00      7261824   871631185
4 2016-01-01 14:10:00      7261824   871631185

This approach is significantly slower, since you have to search through the entirety of df2 for each row in df1. What I have written can probably be optimized further, but this will still take a long time on large datasets.

Igor Raush
  • 15,080
  • 1
  • 34
  • 55
  • Looks like a nice start for me to work with. Regarding your first solution, can we include a plus and minus interval range to prevent the issue of events being on the wrong side of an interval? The interval would be a string like in the example I typed out. Not sure if the logic is 100% sound but I had it work on test data in excel. – trench Jan 21 '16 at 02:02
  • 1
    I think yours would suffer from the same issue. Consider that you are mapping a continuous time range into discrete intervals. This means that you can always think of a pair of timestamps which will be close enough on the continuous range, but fall into different intervals. I'm not sure if my approach is completely equivalent to yours (although I think it might be), but the general idea holds. – Igor Raush Jan 21 '16 at 02:12
  • Haha what a headache. Thanks though I'll test this later and will let you know. It should improve my matching at least – trench Jan 21 '16 at 02:15
  • 1
    Just realized that the interval binning solution can be made significantly nicer using [DatetimeIndex.snap](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DatetimeIndex.snap.html). Didn't know about that method until now. – Igor Raush Jan 21 '16 at 02:17
  • That seems interesting but the documentation is completely unhelpful. I haven't worked with DateTimeIndex before so I am not quite certain how to proceed. df1 = df1.set_index(pd.DatetimeIndex(df1['call_start'], drop = False)) That seems to have created the index but I tried a few attempts at adding .snap and was unsuccessful. I couldn't find a nice example to follow online when I googled. – trench Jan 21 '16 at 17:11
  • I gave it 10 minutes and wasn't able to get it to work either, sorry. – Igor Raush Jan 21 '16 at 17:26
1

I would suggest to use the built-in pandas Series dt round function, to round both dataframe to a common time, for example round up to every 5min. So the time will always be in format: 01:00:00 and then 01:05:00 for example. In that way, both dataframe will have similar time index to perform the merge.

Please see documentation and examples here pandas.Series.dt.round