Faster method to join and filter two overlapping DataFrame with extra shared variable

Question

This question is a "sequel" to the one I asked here. I have 2 DataFrames that look like this:

df1 =
               Timestamp    CP_ID  Con  Value
0   2018-01-02 07:30:00.000 3107    1   6275.20
1   2018-01-02 07:45:00.000 3107    1   6275.95
2   2018-01-02 08:00:00.000 3107    1   6276.70
3   2018-01-02 08:15:00.000 3107    1   6277.45
4   2018-01-02 08:30:00.000 3107    1   6278.19

df2 =
          dStart                dEnd        CP_ID   Con A_Id
0   2018-03-12 18:58:53 2018-03-12 21:19:33 30554   1   03FC4BA8
1   2018-04-16 17:53:25 2018-04-16 20:22:52 30554   1   03FC4BA8
2   2018-05-10 19:19:47 2018-05-10 21:28:04 30554   1   03FC4BA8
3   2018-05-26 18:03:19 2018-05-26 20:49:23 30553   1   03FC4BA8
4   2018-08-10 18:26:29 2018-08-10 20:57:15 30827   1   03FC4BA8

df2 is a list of all sessions from all users (dStart/dEnd of each session).

df1 is user specific session info (Timestamp throughout each session).

CP_ID and Con specify a unique location.

A_Id specifies a unique user.

I want to merge these two tables so that I'm left with all of the info for a given user's sessions. My current solution, as proposed by Wen-Ben in my previous post, is something like:

df = pd.merge(df2,df1,how='left',left_on=['CP_ID','Con'], \
                      right_on=['CP_ID','Con'])

df = df.loc[(df.Timestamp>df.dStart)&(df.Timestamp<df.dEnd)]

This accomplishes the task, but my issue is that it takes a relatively long time. I need to do this operation for thousands of users, and both steps (.merge and .loc) take a very long proportion of the runtime of my script.

Is there a more efficient way of doing this?

yes, put it in a database and join it there, it will be much faster — gold_cy, Feb 21 '19 at 12:19
@aws_apprentice there are several reasons why I need/prefer to do this in python. Any ideas on how to improve in pandas? — thepez87, Feb 21 '19 at 12:52
you can put it in a db and still do it in python? python ships with sqlite3 in the stdlib which can probably perform this a lot faster if set up correctly, honestly that’s your best bet — gold_cy, Feb 21 '19 at 13:23
@aws_apprentice I am not convinced that putting the data into a database will makes things faster, see http://wesmckinney.com/blog/high-performance-database-joins-with-pandas-dataframe-more-benchmarks/ (But maybe the benchmark is out of date) — JoergVanAken, Feb 21 '19 at 15:15
https://blog.thedataincubator.com/2018/05/sqlite-vs-pandas-performance-benchmarks/ it seems that `pandas` is 2-5x faster for joins than sqlite however, sqlite is 10-50x faster fo filtering which is needed here, I would try to do it in sqlite and see what happens — gold_cy, Feb 21 '19 at 16:07
@aws_apprentice could you please write your comment as an answer with an example of how that might look? I have no experience with sqlite or sending from python, I've only done calls from SQL Server using pyodbc. — thepez87, Feb 22 '19 at 08:16

Faster method to join and filter two overlapping DataFrame with extra shared variable

0 Answers0

Linked