I have two frames. df1
containing thousands of rows of four columns of string data. The string data can be almost anything but what matters is which columns match. Here I just chose character names for illustration. Within one row of df1
, the four columns when compared, are a code for 1 of 14 fairly complex equations. df2
contains the code and pre-calculated results for the 14 equations. Instead of calculating each of the thousands of values on the fly, I want to compare the four columns in df1
and populate a Result
column using the pre-calculated results from df2
.
df1
looks like this:
Col1 Col2 Col3 Col4
0 'Sam' 'Sam' 'Sam' 'Sam' #<----this is 'AAAA'
1 'Sam' 'Sam' 'Eric' 'Eric' #<----this is 'AABB'
2 'Sam' 'Eric' 'Sam' 'Sam' #<----this is 'ABAA'
.. ....
n 'Sam' 'Eric' 'Piggy' 'Jack' #<----this is 'ABCD'
df2
looks like this:
Code Result
0 'AAAA' 1.0040
1 'ABAB' 7.7777
2 'AABA' 5.5400
3 'ABAA' 0.0400
4 'BAAA' 1.0250
5 'ABAB' 2.0000
6 'ABAC' 6.5300
7 'ABCA' 5.0090
8 'BAAC' 1.0008
9 'BACA' 1.0000
10 'AABB' 56.000
11 'AABC' 0.2500
12 'BCAA' 0.5000
13 'ABCD' 1.7777
Afterwards df1
should look like this:
Col1 Col2 Col3 Col4 Result
0 'Sam' 'Sam' 'Sam' 'Sam' 1.0040
1 'Sam' 'Sam' 'Eric' 'Eric' 56.000
2 'Sam' 'Eric' 'Sam' 'Sam' 0.0400
.. .... .....
n 'Sam' 'Eric' 'Piggy' 'Jack' 1.7777
Because I am unaware of a way to do this all at once, I thought the quickest way to handle this would be to apply a mask to df1
14 times and update the masked data based on an np.array created by a mask from df2. So the first update ('AAAA') would look like this:
#create mask for 'AAAA' in df1
mask1 = (df1['Col1'] == df2['Col2']) & (df1['Col1'] == df2['Col3']) & (df1['Col1'] == df2['Col4'])
#create mask for 'AAAA' in df2
mask2 = df2['Code'] == 'AAAA'
#create a temporary array to update df1
temp_arr = np.tile(df2.loc[mask2, 'Result'].to_numpy(), np.sum(mask1))
#add the array to the column
df1.loc[mask1, 'Result'] = temp_arr
If there's a more efficient way to do this comparison, please let me know. Thanks.
As soon as I can answer, I'll post my solution using @Hwei Geok Ng 's suggestion but unless someone has something more efficient, I'll leave that suggestion as the answer.
Edit: I apologize. I had to edit the output table. I made a mistake in editing. The table originally had letters instead of the data that was supposed to be there. It is correct now.