Comparing 4 columns in a dataframe

Question

I have two frames. df1 containing thousands of rows of four columns of string data. The string data can be almost anything but what matters is which columns match. Here I just chose character names for illustration. Within one row of df1, the four columns when compared, are a code for 1 of 14 fairly complex equations. df2 contains the code and pre-calculated results for the 14 equations. Instead of calculating each of the thousands of values on the fly, I want to compare the four columns in df1 and populate a Result column using the pre-calculated results from df2.

df1 looks like this:

     Col1    Col2    Col3    Col4
0    'Sam'   'Sam'   'Sam'   'Sam'  #<----this is 'AAAA'
1    'Sam'   'Sam'   'Eric'  'Eric' #<----this is 'AABB' 
2    'Sam'   'Eric'  'Sam'   'Sam'  #<----this is 'ABAA'
..   ....                          
n    'Sam'   'Eric'  'Piggy' 'Jack' #<----this is 'ABCD'

df2 looks like this:

      Code    Result
0    'AAAA'   1.0040
1    'ABAB'   7.7777
2    'AABA'   5.5400
3    'ABAA'   0.0400
4    'BAAA'   1.0250
5    'ABAB'   2.0000
6    'ABAC'   6.5300
7    'ABCA'   5.0090
8    'BAAC'   1.0008
9    'BACA'   1.0000
10   'AABB'   56.000
11   'AABC'   0.2500
12   'BCAA'   0.5000
13   'ABCD'   1.7777

Afterwards df1 should look like this:

     Col1    Col2    Col3    Col4    Result
0    'Sam'   'Sam'   'Sam'   'Sam'   1.0040
1    'Sam'   'Sam'   'Eric'  'Eric'  56.000
2    'Sam'   'Eric'  'Sam'   'Sam'   0.0400
..   ....                            .....
n    'Sam'   'Eric'  'Piggy' 'Jack'  1.7777

Because I am unaware of a way to do this all at once, I thought the quickest way to handle this would be to apply a mask to df1 14 times and update the masked data based on an np.array created by a mask from df2. So the first update ('AAAA') would look like this:

#create mask for 'AAAA' in df1
mask1 = (df1['Col1'] == df2['Col2']) & (df1['Col1'] == df2['Col3']) & (df1['Col1'] == df2['Col4'])

#create mask for 'AAAA' in df2
mask2 = df2['Code'] == 'AAAA'

#create a temporary array to update df1
temp_arr = np.tile(df2.loc[mask2, 'Result'].to_numpy(), np.sum(mask1))

#add the array to the column
df1.loc[mask1, 'Result'] = temp_arr

If there's a more efficient way to do this comparison, please let me know. Thanks.

As soon as I can answer, I'll post my solution using @Hwei Geok Ng 's suggestion but unless someone has something more efficient, I'll leave that suggestion as the answer.

Edit: I apologize. I had to edit the output table. I made a mistake in editing. The table originally had letters instead of the data that was supposed to be there. It is correct now.

Just looking at your sample data, maybe you can try `df.rank(axis=1, method="dense", ascending=False)`? — Henry Yik, Sep 23 '20 at 16:01
Not sure this helps any. I figured out one issue so have adjusted the question. — Dan, Sep 23 '20 at 17:01

score 1 · Accepted Answer · answered Sep 23 '20 at 17:35

1

Here's an idea:

Find out the number of unique representations across df1 referring to this. For example, you get 5 unique values 'A', 'B', 'C', 'D', and 'E' across all columns.
Make a new column at df1 and assign the 'code' for the four columns. For example, df1[0, code'] = 'AAAA'.
Join the column 'code' at df1 with the column 'code' at df2. You'll get the 'result' column at df1.

answered Sep 23 '20 at 17:35

Hwei Geok Ng

76
4

This may be it. I have to think on it. All the possible combinations are the 14. I just have to figure out how to populate a code column...Ill get back to you. – Dan Sep 23 '20 at 17:43
Looks like this is the ticket. Thanks for the help! – Dan Sep 23 '20 at 17:47
Someone else answered and then deleted. I'd like to see if it's faster... – Dan Sep 23 '20 at 17:50

Comparing 4 columns in a dataframe

1 Answers1