1

I have two dataframes A and B:

For example:

import pandas as pd
import numpy as np
In [37]:
A = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200]})

A[["Start","End"]]
Out[37]:
Start   End
0   10  11
1   11  11
2   20  35
3   62  70
4   198 200
In [38]:
B = pd.DataFrame({'Start': [8, 5, 8, 60], 'End': [10, 90, 13, 75], 'Info': ['some_info0','some_info1','some_info2','some_info3']})

B[["Start","End","Info"]]
Out[38]:
Start   End Info
0   8   10  some_info0
1   5   90  some_info1
2   8   13  some_info2
3   60  75  some_info3

I would like to add column info to dataframe A based on if the interval (Start-End) of A overlaps with the interval of B. In case, the A interval overlaps with more than one B interval, the info corresponding to the shorter interval should be added.

I have been looking arround how to manage this issue and I have found kind of similar questions but most of their answers are using iterrows() which in my case, as I am dealing with huge dataframes is not viable.

I would like something like:

A.merge(B,on="overlapping_interval", how="left")

And then drop duplicates keeping the info coming from the shorter interval.

The output should look like this:

In [39]:
C = pd.DataFrame({'Start': [10, 11, 20, 62, 198], 'End': [11, 11, 35, 70, 200], 'Info': ['some_info0','some_info2','some_info1','some_info3',np.nan]})

C[["Start","End","Info"]]
Out[39]:
Start   End Info
0   10  11  some_info0
1   11  11  some_info2
2   20  35  some_info1
3   62  70  some_info3
4   198 200 NaN

I have found this question really interesting as it suggests the posibility of solving this issue using pandas Interval object. But after lots attempts I have not managed to solve it.

Any ideas?

Pau RS
  • 86
  • 6
  • `iterrows` gives a generator so would be a good fit for huge DataFrames. you already tried it ?? – Elmex80s Jan 09 '18 at 13:01
  • It is not problem of memory but time. – Pau RS Jan 09 '18 at 13:21
  • For now I would write your own algorithm which handles this task. Your first step might be to sort both Dataframes. --- Btw1: to make your question better I would suggest to give an example of the output ---- Btw2: if you print Dataframes in the console then they will be outlined and look better, the ones in your question are badly outlined. – Elmex80s Jan 09 '18 at 13:27
  • @PauRS, can you elaborate on interval edges (left and right) - are they closed/open and please add your desired data set? – MaxU - stand with Ukraine Jan 09 '18 at 13:47
  • Please post the desired output. – user1717828 Jan 09 '18 at 13:59
  • Thank you all! I have just edited the question. Hope that help you to understand my issue. – Pau RS Jan 09 '18 at 14:09
  • @MaxU the intervals are closed. I have managed to add a new column to A and B dataframes with datatype interval (with apply(lambda row: pd.Interval(left=row.Start, right=row.End, closed="both")). But don't know how to do the merge efficiently. – Pau RS Jan 09 '18 at 14:18

1 Answers1

1

I would suggest to do a function then apply on the rows:

First I compute the delta (End - Start) in B for sorting purpose

B['delta'] = B.End - B.Start

Then a function to get information:

def get_info(x):
    #Fully included
    c0 = (x.Start >= B.Start) & (x.End <= B.End)
    #start lower, end include
    c1 = (x.Start <= B.Start) & (x.End >= B.Start)
    #start include, end higher
    c2 = (x.Start <= B.End) & (x.End >= B.End)

    #filter with conditions and sort by delta
    _B = B[c0|c1|c2].sort_values('delta',ascending=True)

    return None if len(_B) == 0 else _B.iloc[0].Info #None if no info corresponding

Then you can apply this function to A:

A['info'] = A.apply(lambda x : get_info(x), axis='columns')


print(A)
   Start  End        info
0     10   11  some_info0
1     11   11  some_info2
2     20   35  some_info1
3     62   70  some_info3
4    198  200        None

Note:

  • Instead of using pd.Interval, make your own conditions. cx are your intervals definitions, change them to get the exact expected behaviour
David Leon
  • 1,017
  • 8
  • 25
  • Thanks a lot! That looks to be a great solution. Just note a typo: instead of len(_dfb) should be len(_B) – Pau RS Jan 09 '18 at 15:20
  • I will now implement it to my code and check if its really as efficient as it seems to be. I will let you know :) – Pau RS Jan 09 '18 at 15:22
  • Sorry for the typo, I had to rename after you put the samples. Let me know if it fits your pb. – David Leon Jan 09 '18 at 15:26
  • As expected, super efficient. It lasted about 5 minutes (A has 120,000 rows and B 400,000 rows). Really appreciate your help!! – Pau RS Jan 09 '18 at 15:46