I would like to merge data if they are between a given range
As example:
import pandas as pd
import random
data1 = {
'start': [20,38,42,42,67,67,98],
'stop': [20,38,42,42,67,77,98],
'foo': [1,2,3,3,2,2,4],
'bar': [4,3,2,2,2,4,1],
'an id': [1,2,1,2,1,2,1]
}
data1_df = pd.DataFrame.from_dict(data1).groupby(by=['start', 'stop', 'foo', 'bar'], as_index=False).agg({'start': 'first',
'stop': 'first', 'foo': 'first', 'bar': 'first', 'an id': list})
data2 = { 'something': ['na'] * 98,
'pos': [i for i in range(1,99)],
'value': [random.randint(1,30) for _ in range(1,99)],
'is low': [False for i in range(1,99)]}
data2['is low'] = [ True if data2['value'][i] < 10 else False for i in range(0,98)]
data2_df = pd.DataFrame.from_dict(data2)
which could give something like
In [1]: data1_df.head()
Out[1]:
start stop foo bar an id
0 20 20 1 4 [1]
1 38 38 2 3 [2]
2 42 42 3 2 [1, 2]
3 67 67 2 2 [1]
4 67 77 2 4 [2]
In [2]: data2_df.head()
Out[2]:
something pos value is low
0 na 1 27 False
1 na 2 15 False
2 na 3 3 True
3 na 4 18 False
4 na 5 27 False
So the problem is how to get the is_low
column from data2
to data1
such as if one position is low in the range data1[start] : data2[stop]
then is set to low otherwise False
without the problem of range value I would do :
In [3]: pd.merge_ordered(data1_df,data2_df, how='inner', left_on='start', right_on='pos')
Out[3]:
start stop foo bar an id something pos value is low
0 20 20 1 4 [1] na 20 13 False
1 38 38 2 3 [2] na 38 23 False
2 42 42 3 2 [1, 2] na 42 13 False
3 67 67 2 2 [1] na 67 2 True
4 67 77 2 4 [2] na 67 2 True
5 98 98 4 1 [1] na 98 20 False
but that do not work as the fifth row is on range of value (start: 67, stop 77)
which could be expressed like this:
any(data2_df[data2_df['pos'].ge(67) & data2_df['pos'].le(77)]['is low'])
as it is a range I expect (if not aggregated like above with any
)
start stop foo bar an id is low
0 20 20 1 4 [1] [False]
1 38 38 2 3 [2] [False]
2 42 42 3 2 [1, 2] [False]
3 67 67 2 2 [1] [True]
4 67 77 2 4 [2] [True,False,True…]
5 98 98 4 1 [1] [False]
Maybe in such case I have to use a row iterator …
Thanks for your help