0

I have the following dataframes:

print(df)

client_id   route_id         start                       
    1          1_2        2019-01-01 03:00:00     
    8          1_4        2019-01-23 10:00:00     
    3          1_2        2019-01-15 04:10:00     
    4          1_10       2019-01-10 05:00:00
...

print(accident_df)

   route_id        start         end
      1_2        2018-12-24   2018-01-03
      1_4        2019-01-25   2019-01-29
      1_2        2019-02-24   2019-02-28
      1_10       2019-01-05   2019-01-15

...

I would like to merge them, following these conditions (&):

1."route_id" must match

2.df.start must be in between accident_df.start and accident_df.end (therefore, greater than the first one and smaller than the latter)

print(final_df)

client_id   route_id         start                       
    1          1_2        2019-01-01 03:00:00            
    4          1_10       2019-01-10 05:00:00

Since both dataframes are quite large (>10M lines), and thousands of "route_id"s exist, a solution based on merging on "route_id" and then selecting rows by conditions would be extremely inefficient, causing memory-related problems (as in Python Pandas - Compare 2 dataframes, multiple parameters).

How can I achieve that efficiently?

Alessandro Ceccarelli
  • 1,775
  • 5
  • 21
  • 41

1 Answers1

0

We can say that this approach is more efficient that join since first it will cut df1 to a smaller dataframe and then matches the route_id's. (I'm not an expert)

Suppose your datasets are stored in excel sheets (it can be on any other format/file)

df1 = pd.read_excel('your_file.xlsx',sheet_name='Sheet1')
df2 = pd.read_excel('your_file.xlsx',sheet_name='Sheet2')

With pandasql you can get what you want using SQL script:

import pandas as pd
import pandasql as ps


q1 = """
     SELECT client_id, route_id, start FROM df1 
     WHERE EXISTS 
     (SELECT 1 FROM df2 WHERE df1.start BETWEEN df2.start AND df2.end)
     and route_id IN (SELECT route_id FROM df2)
     """

print(ps.sqldf(q1, locals()))

results in :

       client_id route_id                       start
0          1      1_2  2019-01-01 03:00:00.000000
1          4     1_10  2019-01-10 05:00:00.000000
Ehsan
  • 711
  • 2
  • 7
  • 21