How to optimise an inequality join in Pandas?

Question

I have a cross-join-like operation that I have implemented using a for loop. I need to make it fast and preferably elegant. It creates a block entry per day with a date range condition.

This works fine for small datasets but completely stalls into a very slow runtime for larger datasets. I know that it can be vectorized. My implementation is very bad.

I have looked at the other posts on how to vectorize loops in DataFrames. I read 10 minutes to pandas as per suggested by this post How to iterate over rows in a DataFrame in Pandas, tried using lambda functions. Messed with Cython. I just can't get it.

I tried implementing [pandas.MultiIndex.to_frame] and I have a strong feeling this, or one of it's cousins, is a good way to go. I have also tried a bunch of other things and nothing.

I want to learn to code elegantly. All suggestions, variations on the solution and, comments are welcome.

from datetime import datetime
import pandas as pd 

beginning = pd.to_datetime('14/09/2021', dayfirst=True)
today = pd.to_datetime(datetime.today())
date_range = pd.date_range(start=beginning, end=today)  # .tolist()

frame = pd.DataFrame(columns=['Record_Date', 'Identifier', 'start_date', 'end_date', 'color'])
block = pd.DataFrame(
    {'Identifier': ['4913151F', 'F4E9124A', '31715888', 'D0C57FCA', '57B4D7EB', 'E46F1E5D', '99E0A2F8', 'D77E342E',
                    'C596D233', 'D0EED63F', 'D0C57FCA'],
     'start_date': ['03/11/2020', '05/07/2022', '22/12/2016', '17/03/2024', '14/10/2022', '08/08/2022', '04/11/2020',
                    '13/03/2023', '05/11/2021', '12/27/2022', '13/06/2022'],
     'end_date': ['11/07/2023', '11/04/2023', '14/12/2018', '20/01/2025', '15/06/2023', '09/01/2023', '16/07/2022',
                  '19/05/2024', '24/09/2022', '17/11/2023', '13/06/2023'],
     'color': ['red', 'green', 'magenta', 'yellow', 'light_blue', 'dark_blue', 'black', 'white', 'pink', 'orange',
               'yellow']})

block.start_date = pd.to_datetime(block.start_date, dayfirst=True, format='mixed')
block.end_date = pd.to_datetime(block.end_date, dayfirst=True, format='mixed')
block_uniques = block.drop_duplicates(['Identifier', 'start_date'])

for x in date_range:
    temp_df = block_uniques[(block_uniques.start_date <= x) & (block_uniques.end_date >= x)] 
    temp_df.insert(0, 'Record_Date', x)
    frame = pd.concat([frame, temp_df])
frame = frame.sort_values(['Record_Date', 'Identifier'])
frame = frame.reset_index().drop('index', axis=1)

print(frame)

Output and solution:

     Record_Date Identifier start_date   end_date   color
0     2021-09-14   4913151F 2020-11-03 2023-07-11     red
1     2021-09-14   99E0A2F8 2020-11-04 2022-07-16   black
2     2021-09-15   4913151F 2020-11-03 2023-07-11     red
3     2021-09-15   99E0A2F8 2020-11-04 2022-07-16   black
4     2021-09-16   4913151F 2020-11-03 2023-07-11     red
...          ...        ...        ...        ...     ...
2641  2023-07-05   D0EED63F 2022-12-27 2023-11-17  orange
2642  2023-07-05   D77E342E 2023-03-13 2024-05-19   white
2643  2023-07-06   4913151F 2020-11-03 2023-07-11     red
2644  2023-07-06   D0EED63F 2022-12-27 2023-11-17  orange
2645  2023-07-06   D77E342E 2023-03-13 2024-05-19   white

[2646 rows x 5 columns]

@Quant1892387 That is very slow, that will not work. That's exactly what I'm avoiding. — Gorgonzola, Jul 06 '23 at 22:04
If your for loop has many iterations, `pd.concat` inside it is a bad idea. Its better to gather all `temp_df` into a list and then concat all those at the same time. — dankal444, Jul 06 '23 at 22:22
@dankal444 store it in a dictionary or something else I suppose. And I agree, I'll do that. But that' still not the solution I'm looking for. — Gorgonzola, Jul 06 '23 at 22:24
Have you `line_profile`d your code? It is first thing to do - locate the bottleneck instead of optimizing everything. Second thing, if the problem is with larger datasets it would be better to have example more like it — dankal444, Jul 06 '23 at 22:28
@dankal444 I'll try and try that. There is an underlying need to vectorize this and many other problems. So, thats why I'm most interested in those solutions. For the dataset, we could df_new = pd.DataFrame(np.repeat(frame.values, 10000, axis=0)) that will give you a good example of runtime. I think that's easiest, instead of dropping a massive dataset here. — Gorgonzola, Jul 06 '23 at 22:38
Wasn't suggesting dropping whole dataset here. Sometimes simple reapeat or randomization of dataframe (which we could do) gives dataset that has different characteristics than real one and optimizations for one will not work for the other. That's why I was asking you to do it, because you know the "nature" of the real dataset and what technique of creating random dataset is fine. — dankal444, Jul 06 '23 at 23:30

score 2 · Accepted Answer · answered Jul 06 '23 at 23:40

2

Looks like some form of inequality join; conditional_join offers an efficient way to handle this. Note that if your dates in block are not overlapping, then pd.IntervalIndex is suitable and performant.

Kindly install the dev version which has an optimised version of the function:

# pip install pyjanitor
# install the dev for an optimised version
# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import janitor
import pandas as pd

# convert date_range to either a named series, or a dataframe
date_range = pd.Series(date_range, name = 'date')

(block
.conditional_join(
    date_range, 
    # column from the left,
    # column from the right,
    # operator
    ('start_date', 'date', '<='), 
    ('end_date', 'date', '>='),
    # in some scenarios,
    # numba might offer a perf boost
    use_numba=False,
   )
)
     Identifier start_date   end_date   color       date
0      4913151F 2020-11-03 2023-07-11     red 2021-09-14
1      4913151F 2020-11-03 2023-07-11     red 2021-09-15
2      4913151F 2020-11-03 2023-07-11     red 2021-09-16
3      4913151F 2020-11-03 2023-07-11     red 2021-09-17
4      4913151F 2020-11-03 2023-07-11     red 2021-09-18
...         ...        ...        ...     ...        ...
2644   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-09
2645   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-10
2646   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-11
2647   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-12
2648   D0C57FCA 2022-06-13 2023-06-13  yellow 2023-06-13

[2649 rows x 5 columns]

answered Jul 06 '23 at 23:40

sammywemmy

27,093
4
17
31

This definitely answers the question. It runs almost instantaneously. I'm impressed, thank you. I would love to see other implementations but this is amazing. Many thanks! 10/10 – Gorgonzola Jul 07 '23 at 02:01
There is some very strange behavior going on. After using janitor both implementations run super fast.. I'm so confused. I'm also getting this warning. RuntimeWarning: subpackages can technically be lazily loaded, but it causes the package to be eagerly loaded even if it is already lazily loaded.So, you probably shouldn't use subpackages with this lazy feature. warnings.warn(msg, RuntimeWarning) – Gorgonzola Jul 07 '23 at 02:26
the runtimewarning is coming from where? pandas? pyjanitor? Are you running this in a notebook? – sammywemmy Jul 07 '23 at 02:28
I have no clue. I'm guessing the lazy loading comes from pyjanitor, and I didn't have it before. All the warning says: \venv\Lib\site-packages\lazy_loader\__init__.py:185: RuntimeWarning: subpackages can technically be lazily loaded, but it causes the package to be eagerly loaded even if it is already lazily loaded.So, you probably shouldn't use subpackages with this lazy feature. warnings.warn(msg, RuntimeWarning) I am running it in PyCharm – Gorgonzola Jul 07 '23 at 02:30
ahhh yes, pyjanitor uses lazy loading to help with module import speed. Do you mind raising an issue on the pyjanitor github page, so someone with more knowledg on module import loading can have a look? it doesn't pop up for me when i use it in jupyter notebook, hence the question. maybe its a pycharm thing, i dont know – sammywemmy Jul 07 '23 at 02:32
I would love to. I will. First I will run it outside of PyCharm to see how that goes. – Gorgonzola Jul 07 '23 at 02:35
Ok, I ran it in terminal and the same warning happened. I will go to the pyjanitor github page and report it. But. Do you know why both implementations now run super fast? Why the first loop one I made is running fast now? – Gorgonzola Jul 07 '23 at 02:43
1

kindly run a timeit on both codes and share the results. I assumed you installed the dev version for pyjanitor? – sammywemmy Jul 07 '23 at 02:44
I ran the code in a new virtual environment. Changed to python 3.9. The results were as expected. I couldn't replicate my conclusion. Disregard my comment. On the original dataset, the results were 35 seconds vs 0 seconds. – Gorgonzola Jul 10 '23 at 15:23
Also, can you help me find a better title for the question. I feel it is perhaps not the optimal. I think it's a useful question that perhaps doesn't have the best title. – Gorgonzola Jul 10 '23 at 17:55
1

i think it's a good title, probably just remove the caveats -> `how to optimise an inequality join in pandas`. Also, this [page](https://stackoverflow.com/q/23508351/7175713) has lots of options for inequality joins – sammywemmy Jul 10 '23 at 23:07
Pyjanitor has released a new version, so you can install that without having the lazy loading warning. – sammywemmy Jul 28 '23 at 12:51
Yes! I already did. Also: only this silenced the warning: ``` with warnings.catch_warnings(): warnings.simplefilter("ignore") import janitor ``` – Gorgonzola Jul 28 '23 at 18:29

How to optimise an inequality join in Pandas?

1 Answers1