My question is an extension of this one made a few years ago.
I'm attempting a left join but one of the columns I want to join on needs to be a range value. It needs to be a range because expanding it would mean millions of new (and unnecessary) rows. Intuitively it seems possible using Python's in
operator (as x in range(y, z)
is very common) but would involve a nasty for loop and if/else block. There has to be a better way.
Here's a simple version of my data:
# These are in any order
sample = pd.DataFrame({
'col1': ['1b', '1a', '1a', '1b'],
'col2': ['2b', '2b', '2a', '2a'],
'col3': [42, 3, 21, 7]
})
# The 'look-up' table
look_up = pd.DataFrame({
'col1': ['1a', '1a', '1a', '1a', '1b', '1b', '1b', '1b'],
'col2': ['2a', '2a', '2b', '2b', '2a', '2a', '2b', '2b'],
'col3': [range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101), range(0,10), range(10,101)],
'col4': ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
})
I initially tried a merge to see if pandas would understand but there was a type mismatch error.
sample.merge(
look_up,
how='left',
left_on=['col1', 'col2', 'col3'],
right_on=['col1', 'col2', 'col3']
)
# ValueError: You are trying to merge on int64 and object columns. If you wish to proceed you should use pd.concat
Reviewing the documentation for pd.concat
looks like it will not give me result I want either. Rather than appending, I'm still trying to get a result like merge
. I tried to follow the answer given to question I linked at the start but that didn't work either. It's entirely possible I misunderstood how to use np.where
but also I'm hoping there is a solution that is a little less hacky.
Here's my attempt using np.where
:
s1 = sample['col1'].values
s2 = sample['col2'].values
s3 = sample['col3'].values
l1 = look_up['col1'].values
l2 = look_up['col2'].values
l3 = look_up['col3'].values
i, j = np.where((s3[:, None] in l3) & (s2[:, None] == l2) & (s1[:, None] == l1))
result = pd.DataFrame(
np.column_stack([sample.values[i], look_up.values[j]]),
columns=sample.columns.append(look_up.columns)
)
len(result) # returns 0
The result I want should look like this:
col1 col2 col3 col4
'1b' '2b' 42 'h'
'1a' '2b' 3 'c'
'1a' '2a' 21 'b'
'1b' '2a' 7 'e'