1

I have two pandas dataframes df1 and df2 of the form:

df1

start   end   text   source
1       5     abc     1  
8       10    def     1
15      20    ghi     1
25      30    xxx     1
42      45    zzz     1 

df2

start   end   text   source
1       6     jkl     2  
7       9     mno     2
11      13    pqr     2
16      17    stu     2
18      19    vwx     2
32      37    yyy     2
40      47    rrr     2

I want to return the intersections of the two dataframes based on the start and end columns in following format:

out_df

start_1   end_1   start_2   end_2  text_1   text_2
1         5       1         6      abc      jkl        
8         10      7         9      def      mno
15        20      16        17     ghi      stu 
15        20      18        19     ghi      vwx
42        45      40        47     zzz      rrr

What is the best method to achieve this?

Melsauce
  • 2,535
  • 2
  • 19
  • 39
  • I would create lists, explode them, merge, then drop duplicates, if you have the memory to do it. – ifly6 Aug 04 '22 at 13:24
  • 1
    Have you tried the solution [here](https://stackoverflow.com/questions/44367672/best-way-to-join-merge-by-range-in-pandas)? – RVA92 Aug 04 '22 at 13:30
  • @RVA92 I think that is similar, but still too different from my use case. – Melsauce Aug 04 '22 at 13:48

1 Answers1

3

One option is with conditional_join from pyjanitor:

# pip install pyjanitor
import pandas as pd
import janitor

df1.conditional_join(
      df2, 
      ('start', 'end', '<='), 
      ('end', 'start', '>='))

   left                 right
  start end text source start end text source
0     1   5  abc      1     1   6  jkl      2
1     8  10  def      1     7   9  mno      2
2    15  20  ghi      1    16  17  stu      2
3    15  20  ghi      1    18  19  vwx      2
4    42  45  zzz      1    40  47  rrr      2

In the dev version, you can rename the columns, and avoid the MultiIndex (the MultiIndex occurs because the column names are not unique):

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

df1.conditional_join(
        df2, 
        ('start', 'end', '<='), 
        ('end', 'start', '>='), 
        df_columns = {'start':'start_1', 
                      'end':'end_1', 
                      'text':'text_1'},
        right_columns = {'start':'start_2', 
                         'end':'end_2', 
                         'text':'text_2'})

   start_1  end_1 text_1  start_2  end_2 text_2
0        1      5    abc        1      6    jkl
1        8     10    def        7      9    mno
2       15     20    ghi       16     17    stu
3       15     20    ghi       18     19    vwx
4       42     45    zzz       40     47    rrr

The idea for overlaps is the start of interval one should be less than the end of interval 2, while the end of interval two should be less than the start of interval one, that way overlap is assured. I pulled that idea from pd.Interval.overlaps here

Another option is with the piso library; the answer here might point you in the right direction

sammywemmy
  • 27,093
  • 4
  • 17
  • 31
  • Thanks. Is there an alternative method in vanilla/pandas? – Melsauce Aug 04 '22 at 14:50
  • I'm running into errors (updating pandas does not seem to help): pyjanitor ImportError: cannot import name 'apply_if_callable' from 'pandas.core.common' – Melsauce Aug 04 '22 at 16:50
  • Still having the errors? Is the error when you run the code? At what point does the error occur? – sammywemmy Aug 04 '22 at 18:18
  • it turns out its in the import statement itself. import pyjanitor – Melsauce Aug 04 '22 at 19:05
  • . Use import janitor – sammywemmy Aug 04 '22 at 20:13
  • My mistake, that's what I meant and used: import janitor – Melsauce Aug 05 '22 at 13:06
  • odd, cos I am not getting that error. what version of pandas are you on? – sammywemmy Aug 05 '22 at 13:16
  • Hi @sammywemmy, I got a `TypeError: conditional_join() got an unexpected keyword argument 'right_columns'` after running the code. That's weird because in the repo, we can see that keyword https://github.com/pyjanitor-devs/pyjanitor/blob/dev/janitor/functions/conditional_join.py. Do you know how to fix this, please ? And what's about the dev version ? – Timeless Sep 20 '22 at 12:06
  • 1
    @abokey, column selection is in the dev version. Packages usually have a Dev version and a release version. The Dev version has the latest but has not been released yet. You have to install it iwth this :`pip install git+https://github.com/pyjanitor-devs/pyjanitor.git` uninstall the current version before installing. The Dev version also supports numba for more peerformance – sammywemmy Sep 20 '22 at 12:35
  • Thank you. I did uninstall the current version I had and then run your command who successfully installed pyjanitor-0.23.1. However, jupyter still throws the same error. It looks like I still need to run some additionnal commands in Jupyter, as per the documentation https://pyjanitor-devs.github.io/pyjanitor/devguide/. – Timeless Sep 20 '22 at 12:41
  • Hmmmm ... Odd ... Lemme know how it goes. If you can, create a new, clean environment, and then install the Dev version. If all fails, you can do the column renaming before the conditional join – sammywemmy Sep 20 '22 at 12:43