1

I have these two datasets:


val_1 = [81.322, 81.342, 81.452, 81.552, 82.522, 82.562, 82.722, 81.723, 81.322, 81.332 ]
val_2 = [0.234, 0.231, 0.333, 0.324, 0.666, 0.645, 0.334, 0.345, 0.878, 0.888]
data_1 = pd.DataFrame({'val_1': val_1, 'val_2': val_2})
data_1

    val_1   val_2
0   81.322  0.234
1   81.342  0.231
2   81.452  0.333
3   81.552  0.324
4   82.522  0.666
5   82.562  0.645
6   82.722  0.334
7   81.723  0.345
8   81.322  0.878
9   81.332  0.888

and


val_1_start = [81.100,81.41, 81.99, 81.320 ]
val_2_start = [0.230,0.331,0.32, 0.875 ]
val_1_end = [81.400,81.554,82.8, 81.333 ]
val_2_end = [0.281,0.335,0.68,  0.890]
value = [24,21,34,11]
data_2 = pd.DataFrame({'val_1_start': val_1_start, 'val_2_start': val_2_start, 
                       'val_1_end': val_1_end, 'val_2_end':val_2_end, 'value': value})
data_2


   val_1_start  val_2_start  val_1_end  val_2_end   value
0   81.10       0.230        81.400     0.281       24
1   81.41       0.331        81.554     0.335       21
2   81.99       0.320        82.800     0.680       34
3   81.32      0.875         81.333     0.890       11

I am trying to find where val_1 is between val_1_start and val_1_end and val_2 is between val_2_start and val_2_end. Then I would add the value to the first dataset.

I wouldn't mind either output here

val_1   val_2    value
81.322  0.234  24
81.342  0.231  24
81.452  0.333  21
81.552  0.324  NA
82.522  0.666  34
82.562  0.645  34
82.722  0.334  34
81.723  0.345  NA
81.322  0.878  11
81.332  0.888  11

val_1   val_2    value
81.322  0.234  24
81.342  0.231  24
81.452  0.333  21
82.522  0.666  34
82.562  0.645  34
82.722  0.334  34
81.322  0.878  11
81.332  0.888  11

Please let me know how I can do this

user3234242
  • 165
  • 7

2 Answers2

1

Use cross join in DataFrame.merge and then filter by boolean indexing:

df = data_1.merge(data_2, how='cross')

m = (df['val_1'].between(df['val_1_start'], df['val_1_end']) &
     df['val_2'].between(df['val_2_start'], df['val_2_end']))
df = df.loc[m, ['val_1','val_2','value']]
print (df)
     val_1  val_2  value
0   81.322  0.234     24
4   81.342  0.231     24
9   81.452  0.333     21
18  82.522  0.666     34
22  82.562  0.645     34
26  82.722  0.334     34
35  81.322  0.878     11
39  81.332  0.888     11
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • I get the message that `MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False` how would I over come this? – user3234242 Nov 09 '21 at 12:02
  • @user3234242 - Is possible upgrade pandas? – jezrael Nov 09 '21 at 12:02
  • @user3234242 - Or if not possible use [this](https://stackoverflow.com/a/46895905/2901002) solution. – jezrael Nov 09 '21 at 12:04
  • Out of curiosity, how would I do this if it was the other way around? e.g. the value column is in data_1 instead of data_2 – user3234242 Nov 09 '21 at 14:08
  • @user3234242 - My opinion - I think it should be possible, only in real data should multiple vlues matched, not only one. Solution is same like above. – jezrael Nov 09 '21 at 14:11
1

One option is the conditional_join from pyjanitor; it tries to avoid a cartesian join for non-equi joins ( to reduce memory consumption):

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor

(data_1.conditional_join(
          data_2,
          # column from left, column from right, join type
          ('val_1', 'val_1_start', '>='),
          ('val_1', 'val_1_end', '<='),
          ('val_2', 'val_2_start', '>='),
          ('val_2', 'val_2_end', '<=')
          )
      .loc(axis=1)['val_1', 'val_2', 'value']
)

    val_1  val_2  value
0  81.322  0.234     24
1  81.342  0.231     24
2  81.452  0.333     21
3  82.522  0.666     34
4  82.562  0.645     34
5  82.722  0.334     34
6  81.322  0.878     11
7  81.332  0.888     11

for a left join, pass left to the how parameter:

(data_1.conditional_join(
          data_2,
          ('val_1', 'val_1_start', '>='),
          ('val_1', 'val_1_end', '<='),
          ('val_2', 'val_2_start', '>='),
          ('val_2', 'val_2_end', '<='),
          how = 'left'
          )
      .loc(axis=1)['val_1', 'val_2', 'value']
)

    val_1  val_2  value
0  81.322  0.234   24.0
1  81.342  0.231   24.0
2  81.452  0.333   21.0
3  81.552  0.324    NaN
4  82.522  0.666   34.0
5  82.562  0.645   34.0
6  82.722  0.334   34.0
7  81.723  0.345    NaN
8  81.322  0.878   11.0
9  81.332  0.888   11.0

Note that at the moment, you have to install the pyjanitor dev version to use this function. pip install git+https://github.com/pyjanitor-devs/pyjanitor.git

This page also has options that you may find useful/relevant.

sammywemmy
  • 27,093
  • 4
  • 17
  • 31