7

My question is an extension of this one made a few years ago.

I'm attempting a left join but one of the columns I want to join on needs to be a range value. It needs to be a range because expanding it would mean millions of new (and unnecessary) rows. Intuitively it seems possible using Python's in operator (as x in range(y, z) is very common) but would involve a nasty for loop and if/else block. There has to be a better way.

Here's a simple version of my data:

# These are in any order
sample = pd.DataFrame({
    'col1': ['1b', '1a', '1a', '1b'],
    'col2': ['2b', '2b', '2a', '2a'],
    'col3': [42, 3, 21, 7]
})

# The 'look-up' table
look_up = pd.DataFrame({
    'col1': ['1a', '1a', '1a', '1a', '1b', '1b', '1b', '1b'],
    'col2': ['2a', '2a', '2b', '2b', '2a', '2a', '2b', '2b'],
    'col3': [range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101)],
    'col4': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})

I initially tried a merge to see if pandas would understand but there was a type mismatch error.

sample.merge(
    look_up,
    how='left',
    left_on=['col1', 'col2', 'col3'],
    right_on=['col1', 'col2', 'col3']
)
# ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat

Reviewing the documentation for pd.concat looks like it will not give me result I want either. Rather than appending, I'm still trying to get a result like merge. I tried to follow the answer given to question I linked at the start but that didn't work either. It's entirely possible I misunderstood how to use np.where but also I'm hoping there is a solution that is a little less hacky.

Here's my attempt using np.where:

s1 = sample['col1'].values
s2 = sample['col2'].values
s3 = sample['col3'].values

l1 = look_up['col1'].values
l2 = look_up['col2'].values
l3 = look_up['col3'].values

i, j = np.where((s3[:, None] in l3) & (s2[:, None] == l2) & (s1[:, None] == l1))
result = pd.DataFrame(
    np.column_stack([sample.values[i], look_up.values[j]]), 
    columns=sample.columns.append(look_up.columns)
)

len(result)  # returns 0

The result I want should look like this:

col1  col2 col3 col4
'1b'  '2b'   42  'h'
'1a'  '2b'    3  'c'
'1a'  '2a'   21  'b'
'1b'  '2a'    7  'e'
dlindsay
  • 105
  • 7
  • 1
    I do not know how big your frame is or the largest range (so they may not be the best option) but you can explode and merge: `sample.merge(look_up.explode('col3'), on=['col1', 'col2', 'col3'], how='left')` – It_is_Chris Oct 27 '20 at 20:12
  • I should have added the length currently is about 30 thousand rows. The largest range is 1,000 to 100,000 and the most common ones are probably 150 to 10,000. I think using `df.explode` might be the simplest, albiet brute-force option. Thanks! – dlindsay Oct 27 '20 at 20:17
  • 1
    @dlindsay `df.explode` is a fast, although very memory-heavy operation, but unless you encounter memory issues, it could be the simplest solution indeed. – Marcus Oct 27 '20 at 20:19

1 Answers1

4

Since it looks like ranges are pretty big, and you are working with integer vales, you can just compute the min, max:

columns = look_up.columns

look_up['minval'] = look_up['col3'].apply(min)
look_up['maxval'] = look_up['col3'].apply(max)
    
(sample.merge(look_up, on=['col1','col2'], how='left',
              suffixes=['','_'])
       .query('minval <= col3 <= maxval')
       [columns]
)

Output:

  col1 col2  col3 col4
1   1b   2b    42    h
2   1a   2b     3    c
5   1a   2a    21    b
6   1b   2a     7    e
dlindsay
  • 105
  • 7
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74