Fastest way to merge pandas dataframe on ranges

Question

I have a dataframe A

    ip_address
0   13
1   5
2   20
3   11
.. ........

and another dataframe B

    lowerbound_ip_address   upperbound_ip_address           country
0    0                       10                             Australia
1    11                      20                             China

based on this I need to add a column in A such that

ip_address  country
13          China
5           Australia

I have an idea that I should write define a function and then call map on each row of A. But how would I search through each row of B for this. Is there a better way to do this.

are the lower and upper bounds in fixed intervals of 10? – ShreyasG Sep 12 '17 at 14:45 — ShreyasG, Sep 12 '17 at 14:45
@ShreyasG No they are not fixed – John Constantine Sep 12 '17 at 17:21 — John Constantine, Sep 12 '17 at 17:21

Zero · Answer 1 · 2017-09-12T14:56:16.497

Use pd.IntervalIndex

In [2503]: s = pd.IntervalIndex.from_arrays(dfb.lowerbound_ip_address,
                                            dfb.upperbound_ip_address, 'both')

In [2504]: dfa.assign(country=dfb.set_index(s).loc[dfa.ip_address].country.values)
Out[2504]:
   ip_address    country
0          13      China
1           5  Australia
2          20      China
3          11      China

Details

In [2505]: s
Out[2505]:
IntervalIndex([[0, 10], [11, 20]]
              closed='both',
              dtype='interval[int64]')

In [2507]: dfb.set_index(s)
Out[2507]:
          lowerbound_ip_address  upperbound_ip_address    country
[0, 10]                       0                     10  Australia
[11, 20]                     11                     20      China

In [2506]: dfb.set_index(s).loc[dfa.ip_address]
Out[2506]:
          lowerbound_ip_address  upperbound_ip_address    country
[11, 20]                     11                     20      China
[0, 10]                       0                     10  Australia
[11, 20]                     11                     20      China
[11, 20]                     11                     20      China

Setup

In [2508]: dfa
Out[2508]:
   ip_address
0          13
1           5
2          20
3          11

In [2509]: dfb
Out[2509]:
   lowerbound_ip_address  upperbound_ip_address    country
0                      0                     10  Australia
1                     11                     20      China

nice use of IntervalIndex; want to add a small example like this to the docs ? — Jeff, Sep 13 '17 at 01:23
How would you handle a loc KeyError that could happen if `dfa.ip_address` contains an item that is not covered by any index bound of dfb? — xicocaio, Aug 28 '20 at 18:31

BENY · Answer 2 · 2017-09-12T18:48:38.023

5

Try pd.merge_asof

df['lowerbound_ip_address']=df['ip_address']
pd.merge_asof(df1,df,on='lowerbound_ip_address',direction ='forward',allow_exact_matches =False)
Out[811]: 
   lowerbound_ip_address  upperbound_ip_address    country  ip_address
0                      0                     10  Australia           5
1                     11                     20      China          13

edited Sep 12 '17 at 18:48

answered Sep 12 '17 at 14:52

BENY

317,841
20
164
234

This is very quick when the input dataframe is quite large. There is a caveat that the index column needs to be sorted. – Alex Nov 08 '22 at 12:49

score 4 · Answer 3 · answered Sep 12 '17 at 15:07

IntervalIndex is as of pandas 0.20.0 and the solution by @JohnGalt using it is excellent.

Prior to that version, this solution would work which expands the ip addresses by country for the complete range.

df_ip = pd.concat([pd.DataFrame(
    {'ip_address': range(row['lowerbound_ip_address'], row['upperbound_ip_address'] + 1), 
     'country': row['country']}) 
    for _, row in dfb.iterrows()]).set_index('ip_address')
>>> dfa.set_index('ip_address').join(df_ip)
              country
ip_address           
13              China
5           Australia
20              China
11              China

Fastest way to merge pandas dataframe on ranges

3 Answers3

Linked