1

I would like to merge data if they are between a given range

As example:

import pandas as pd
import random

data1 = { 
         'start': [20,38,42,42,67,67,98], 
         'stop': [20,38,42,42,67,77,98], 
         'foo': [1,2,3,3,2,2,4], 
         'bar': [4,3,2,2,2,4,1], 
         'an id': [1,2,1,2,1,2,1] 
     } 
data1_df = pd.DataFrame.from_dict(data1).groupby(by=['start', 'stop', 'foo', 'bar'], as_index=False).agg({'start': 'first', 
 'stop': 'first', 'foo': 'first', 'bar': 'first', 'an id': list})                                    
data2 = { 'something': ['na'] * 98, 
                   'pos': [i for i in range(1,99)], 
                   'value': [random.randint(1,30) for _ in range(1,99)], 
                   'is low': [False for i in range(1,99)]}
data2['is low'] = [ True if data2['value'][i] < 10 else False for i in range(0,98)]
data2_df = pd.DataFrame.from_dict(data2)

which could give something like

In [1]: data1_df.head()                                                                                                                          
Out[1]: 
   start  stop  foo  bar   an id
0     20    20    1    4     [1]
1     38    38    2    3     [2]
2     42    42    3    2  [1, 2]
3     67    67    2    2     [1]
4     67    77    2    4     [2]

In [2]: data2_df.head()                                                                                                                          
Out[2]: 
  something  pos  value  is low
0        na    1     27   False
1        na    2     15   False
2        na    3      3    True
3        na    4     18   False
4        na    5     27   False

So the problem is how to get the is_low column from data2 to data1 such as if one position is low in the range data1[start] : data2[stop] then is set to low otherwise False

without the problem of range value I would do :

In [3]: pd.merge_ordered(data1_df,data2_df, how='inner', left_on='start', right_on='pos')                                                                
Out[3]: 
   start  stop  foo  bar   an id something  pos  value  is low
0     20    20    1    4     [1]        na   20     13   False
1     38    38    2    3     [2]        na   38     23   False
2     42    42    3    2  [1, 2]        na   42     13   False
3     67    67    2    2     [1]        na   67      2    True
4     67    77    2    4     [2]        na   67      2    True
5     98    98    4    1     [1]        na   98     20   False

but that do not work as the fifth row is on range of value (start: 67, stop 77)

which could be expressed like this: 

any(data2_df[data2_df['pos'].ge(67) & data2_df['pos'].le(77)]['is low']) 

as it is a range I expect (if not aggregated like above with any)

   start  stop  foo  bar   an id     is low
0     20    20    1    4     [1]      [False]
1     38    38    2    3     [2]      [False]
2     42    42    3    2  [1, 2]      [False]
3     67    67    2    2     [1]      [True]
4     67    77    2    4     [2]      [True,False,True…]
5     98    98    4    1     [1]      [False]

Maybe in such case I have to use a row iterator …

Thanks for your help

bioinfornatics
  • 1,749
  • 3
  • 17
  • 36

0 Answers0