6

As input I have two dataframes:

data1 = [{'code':100}, {'code':120}, {'code':110}]
data1 = pd.DataFrame(data1)

   code
 0   100
 1   120
 2   110

data2 = [{'category':1, 'l_bound':99, 'r_bound':105},{'category':2, 'l_bound':107, 'r_bound':110},{'category':3, 'l_bound':117, 'r_bound':135}]
data2 = pd.DataFrame(data2)

   category  l_bound  r_bound
0         1       99      105
1         2      107      110
2         3      117      135

I want to get in the end the following dataframe, with additional column in the first dataframe as a category number if the code lies in the corresponding interval:

    code   category
 0   100          1
 1   120          3
 2   110          2

Intervals are random and the original dataframes are pretty big. Looping with itertuples is too slow. Any pythonic solutions?

Anna Ignashkina
  • 467
  • 4
  • 16
  • pure python - https://eval.in/928533 – splash58 Dec 30 '17 at 18:09
  • 1
    @AntonvBR I haven't looked closely at the example, but the problem (and answers) between the two seem identical... – cs95 Dec 30 '17 at 18:46
  • **Useful observation**: The solutions below might not work if the data are not in a proper format. Make sure that `data1` and `data2` are integers. There also might be mixed formats in some columns of the original datasets (e.g. `'code': '123a'`). To get rid of it just add `data1 = data1[pd.to_numeric(data1['code'], errors='coerce').notnull()] data1['code'] = pd.to_numeric(data1['code'])` – Anna Ignashkina Jan 01 '18 at 15:05

1 Answers1

3

Recreate the dataset:

import pandas as pd

data1 = [{'code':100}, {'code':120}, {'code':113}]
data2 = [{'category':1, 'l_bound':99, 'r_bound':105},
         {'category':2, 'l_bound':107, 'r_bound':110},
         {'category':3, 'l_bound':117, 'r_bound':135}]

data1 = pd.DataFrame(data1)
data2 = pd.DataFrame(data2)

@cᴏʟᴅsᴘᴇᴇᴅ answer (prefered), follow duplicate link:

idx = pd.IntervalIndex.from_arrays(data2['l_bound'], data2['r_bound'], closed='both')
category = data2.loc[idx.get_indexer(data1.code), 'category']

data1['category'] = category.values

Here is a different approach. Create a map with value in range and categories.

# Create a map
d = {i:k for k,v in data2.set_index('category').to_dict('i').items() 
     for i in range(v['l_bound'],v['r_bound']+1)}

# Use map to add new column
data1['category'] = data1.code.map(d)

Finally

print(data1)

Returns:

   code  category
0   100       1.0
1   120       3.0
2   113       NaN

If you want int we can do this:

data1.code.map(d).fillna(-1).astype(int) # -1 meaning no match

And we get:

   code  category
0   100         1
1   120         3
2   113        -1
Anton vBR
  • 18,287
  • 5
  • 40
  • 46