1

I have a dataframe containing lower and upper bounds of index, with a corresponding value, and I am trying to match the value of te corresponding index. My dataframe consists of 400.000 rows. an example of my dataframe:

df1 = pd.DataFrame({'low':[4,7,8],'high':[6,7,21],'value':[10,15,20]})
df2 = pd.DataFrame({'index':[4,5,6,7,8,9]})

Output:

  low   high  value
0  4     6     10
1  7     7     15
2  8     21    20

Now I want to add value of df1 to df2, whenever the index is between low and high, resulting in the following:

  index  value
0  4      10 
1  5      10
2  6      10
3  7      15
4  8      20
5  9      20

I tried to make an intervalIndex using the following post: Searching a particular value in a range among two columns python dataframe

v = df1.loc[:, 'low':'high'].apply(tuple, 1).tolist()
idx = pd.IntervalIndex.from_tuples(v, 'both')
df2['value'] = df1.iloc[idx.get_indexer(df2['index'].values), 'value'].values

But my intervals keep overlapping, I removed duplicated versions, but I still need to remove some overlapping intervals. One way to find this was using a for loop with:

[idx.overlaps(x) for x in idx]

But this takes a lot of time and everytime my memory fails. Is there some fast way to find the overlapping interval?

  • If one of the answers below fixes your issue, you should accept it (click the check mark next to the appropriate answer). That does two things. It lets everyone know your issue has been resolved to your satisfaction, and it gives the person that helps you credit for the assist. [See here](http://meta.stackexchange.com/a/5235) for a full explanation. – sushanth Aug 27 '20 at 10:22

3 Answers3

1

Here is a solution assuming index is sorted, create a dict with limit's and use dict against the dataframe then use ffill() to fill the gaps created by map.

between_ = (
    {**df1.set_index('low')['value'].to_dict(),
     **df1.set_index('high')['value'].to_dict()}
)
# {4: 10, 7: 15, 8: 20, 6: 10, 21: 20}

df2['index'].map(between_).ffill()

0    10.0
1    10.0
2    10.0
3    15.0
4    20.0
5    20.0
Name: index, dtype: float64
sushanth
  • 8,275
  • 3
  • 17
  • 28
1

Build a dataframe to join to using pd.concat()

df1 = pd.DataFrame({'low':[4,7,8],'high':[6,7,21],'value':[10,15,20]})
df2 = pd.DataFrame({'index':[4,5,6,7,8,9]}).set_index("index")

df2 = df2.join(pd.concat([pd.DataFrame(index=pd.RangeIndex(r[0],r[1]+1)).assign(value=r[2]) 
                    for r in df1.values])
         )

output

       value
index       
4         10
5         10
6         10
7         15
8         20
9         20
Rob Raymond
  • 29,118
  • 3
  • 14
  • 30
1

It's been a while since this thread has been opened. But I've recently had a similar problem. I found that was solved quite elegantly using Pandas IntervalIndex.

# Create an open IntervalIndex with both ends closed using setting 'both'
interval_idx = pd.IntervalIndex.from_arrays(df1.low, df1.high, 'both')

# Set the interval
df1 = df1.set_index(interval_idx)

# Define a function for getting all matching intervals from a dataframe
def get_interval_value(x, df=None):
    if df is not None:
        try:
            res = df.loc[x].value
        except:
            res = None
    else:
        res = None
    return res

# Compute interval overlaps
df2['value'] = df2['index'].apply(get_interval_value, df=df1)

Which returns the desired outcome of df2:

   index  value
0  4      10   
1  5      10   
2  6      10   
3  7      15   
4  8      20   
5  9      20