3

I am trying to rewrite some R code in Python and cannot get past one particular bit of code. I've found the foverlaps function in R to be very useful when performing a time-based join, but haven't found anything that works as well in Python3.

What I am doing is joining two data tables where the time in one table falls between the start_time and end_time in another table. The periodicity of the two tables is not the same - table_A occurs on a per second basis and can have multiple entries at each interval, while table_B will have one entry every 0 - 10 minutes at irregular intervals.

This question is very similar to what I am asking: Merge pandas dataframes where one value is between two others

The following code provides my desired output in R:

# Add dummy column to use with foverlaps
table_A <- table_A[, dummy := time]

# Set keys
setkey(table_B, x, y, start_time, end_time)
setkey(table_A, x, y, time, dummy)

# Join tables based on time
joined <- foverlaps(table_A, table_B, type = "within", by.x=c("x", "y", "time", "dummy"), by.y=c("x", "y", "start_time", "end_time"), nomatch=0L)[, dummy := NULL]


> head(table_A)
   time                         x       y     dummy
1: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27 
2: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
3: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
4: 2016-07-11 11:52:27          4077    1     2016-07-11 11:52:27
5: 2016-07-11 11:52:32          4077    1     2016-07-11 11:52:32
6: 2016-07-11 11:52:32          4077    1     2016-07-11 11:52:32


> head(table_B)
                x       y   start_time              end_time
1:              6183    1   2016-07-11 12:00:45     2016-07-11 12:00:56 
2:              6183    1   2016-07-11 12:01:20     2016-07-11 12:01:20   
3:              6183    1   2016-07-11 12:01:40     2016-07-11 12:03:26  
4:              6183    1   2016-07-11 12:04:20     2016-07-11 12:04:40  
5:              6183    1   2016-07-11 12:04:55     2016-07-11 12:04:57  
6:              6183    1   2016-07-11 12:05:40     2016-07-11 12:05:51  

So, any row in table_A where time falls between start_time and end_time will be joined with the corresponding row in table_B, giving an output such as below. I've tried many different things in Python, but haven't found the solution yet.

One thing that may not be apparent from the example data is that multiple x and y values occur at times within the same start_time and end_times.

> head(joined)
  y      x      start_time              end_time                time 
1 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
2 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
3 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
4 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
5 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:46    
6 1      4077   2016-07-11 12:00:45     2016-07-11 12:00:56     2016-07-11 12:00:55 
Community
  • 1
  • 1
Jeff
  • 688
  • 1
  • 13
  • 30
  • Yes, I am using pandas. I thought maybe an asof or merged join would work but no luck. I was hoping to get some ideas without trying to fix one of my attempts that may be way off base. – Jeff Oct 25 '16 at 18:36
  • 3
    Can you provide a sample of what `table_A`, `table_B`, and `joined` (assuming the samples you provide for `A` and `B`) looks like. With that, I can help you... if someone else doesn't beat me to it – piRSquared Oct 25 '16 at 18:46
  • A simple way to mimic this is to do use [`numpy.searchsorted`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.searchsorted.html), and sort your `time` data into an array of `start_time` and `end_time`. Depending on the structure of your data `start_time` might not even be necessary to use. – Jan Christoph Terasa Oct 26 '16 at 01:04
  • I've added some example data, although it is hard to fully portray the dataset (usually consists of 1M+ rows and many columns). – Jeff Oct 26 '16 at 05:14

1 Answers1

3

Consider a straightforward merge with subset using pandas.Series.between(). Merge joins all combinations of the join columns and the subset keeps rows that align to time intervals.

df = pd.merge(table_A, table_B, on=['x', 'y'])                   
df = df[df['time'].between(df['start_time'], df['end_time'], inclusive=True)]

However, one important item is your dates should be casted as datetime type. Currently, your post shows string dates which affects above .between(). Below assumes US dates with month first as MM/DD/YYYY. Either you can convert types during file read in:

dateparse = lambda x: pd.datetime.strptime(x, '%m/%d/%Y %H:%M:%S')

table_A = pd.read_csv('data.csv', parse_dates=[0], date_parser=dateparse, dayfirst=False)

table_B = pd.read_csv('data.csv', parse_dates=[0,1], date_parser=dateparse, dayfirst=False)

Or after read in:

table_A['time'] = pd.to_datetime(table_A['time'], format='%m/%d/%Y %H:%M:%S')

table_B['start_time'], table_B['end_time']=(pd.to_datetime(ser, format='%m/%d/%Y %H:%M:%S') \
                                    for ser in [table_B['start_time'], table_B['end_time']])
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • How did it go? I ran your sample data but it returned an Empty data frame as there are no matches as 11:52a is earlier than the after 12p records. – Parfait Oct 26 '16 at 21:15
  • So, the pd.merge works, however just using a small sample of data (about 1/20th of a typical dataset), it creates a dataframe with 40M rows. When I try to execute the subset code, it immediately fills up my 16GB of memory and never completes. table_A.shape Out[118]: (10961, 25) table_B.shape Out[119]: (3666, 7) joined.shape Out[120]: (40183026, 30) – Jeff Oct 29 '16 at 19:09
  • You might need to do this processing in chunks like filter both A and B by specific month and then concatenate all months together. Did you run data.table's `foverlaps` in R on these large datasests? – Parfait Oct 29 '16 at 23:20
  • I didn't have this issue using foverlaps in R, I think because I was never doing a full join and creating a huge table. It looks like in theory this solution should work, given sufficient memory and time, but unfortunately is not a solution for my dataset. – Jeff Nov 01 '16 at 22:33
  • foverlaps is a part of the data.table sub-universe. As such it would be expected that you would be looking for an API to an SQL engine. – IRTFM Dec 07 '16 at 00:38
  • I originally wrote this in SQL and found that R/Python was better suited for the overall project. Regardless, the question specifically stated that I am looking for a Python answer. – Jeff Jan 10 '17 at 18:05
  • @Parfait perhaps better way of doing it https://stackoverflow.com/questions/46525786/how-to-join-two-dataframes-for-which-datetime-column-values-are-within-a-certain – Bharath M Shetty Jan 02 '18 at 10:12