0

I have two frames. df1 containing thousands of rows of four columns of string data. The string data can be almost anything but what matters is which columns match. Here I just chose character names for illustration. Within one row of df1, the four columns when compared, are a code for 1 of 14 fairly complex equations. df2 contains the code and pre-calculated results for the 14 equations. Instead of calculating each of the thousands of values on the fly, I want to compare the four columns in df1 and populate a Result column using the pre-calculated results from df2.

df1 looks like this:

     Col1    Col2    Col3    Col4
0    'Sam'   'Sam'   'Sam'   'Sam'  #<----this is 'AAAA'
1    'Sam'   'Sam'   'Eric'  'Eric' #<----this is 'AABB' 
2    'Sam'   'Eric'  'Sam'   'Sam'  #<----this is 'ABAA'
..   ....                          
n    'Sam'   'Eric'  'Piggy' 'Jack' #<----this is 'ABCD'

df2 looks like this:

      Code    Result
0    'AAAA'   1.0040
1    'ABAB'   7.7777
2    'AABA'   5.5400
3    'ABAA'   0.0400
4    'BAAA'   1.0250
5    'ABAB'   2.0000
6    'ABAC'   6.5300
7    'ABCA'   5.0090
8    'BAAC'   1.0008
9    'BACA'   1.0000
10   'AABB'   56.000
11   'AABC'   0.2500
12   'BCAA'   0.5000
13   'ABCD'   1.7777

Afterwards df1 should look like this:

     Col1    Col2    Col3    Col4    Result
0    'Sam'   'Sam'   'Sam'   'Sam'   1.0040
1    'Sam'   'Sam'   'Eric'  'Eric'  56.000
2    'Sam'   'Eric'  'Sam'   'Sam'   0.0400
..   ....                            .....
n    'Sam'   'Eric'  'Piggy' 'Jack'  1.7777

Because I am unaware of a way to do this all at once, I thought the quickest way to handle this would be to apply a mask to df1 14 times and update the masked data based on an np.array created by a mask from df2. So the first update ('AAAA') would look like this:

#create mask for 'AAAA' in df1
mask1 = (df1['Col1'] == df2['Col2']) & (df1['Col1'] == df2['Col3']) & (df1['Col1'] == df2['Col4'])

#create mask for 'AAAA' in df2
mask2 = df2['Code'] == 'AAAA'

#create a temporary array to update df1
temp_arr = np.tile(df2.loc[mask2, 'Result'].to_numpy(), np.sum(mask1))

#add the array to the column
df1.loc[mask1, 'Result'] = temp_arr

If there's a more efficient way to do this comparison, please let me know. Thanks.

As soon as I can answer, I'll post my solution using @Hwei Geok Ng 's suggestion but unless someone has something more efficient, I'll leave that suggestion as the answer.

Edit: I apologize. I had to edit the output table. I made a mistake in editing. The table originally had letters instead of the data that was supposed to be there. It is correct now.

Dan
  • 758
  • 6
  • 20

1 Answers1

1

Here's an idea:

  1. Find out the number of unique representations across df1 referring to this. For example, you get 5 unique values 'A', 'B', 'C', 'D', and 'E' across all columns.
  2. Make a new column at df1 and assign the 'code' for the four columns. For example, df1[0, code'] = 'AAAA'.
  3. Join the column 'code' at df1 with the column 'code' at df2. You'll get the 'result' column at df1.
  • This may be it. I have to think on it. All the possible combinations are the 14. I just have to figure out how to populate a code column...Ill get back to you. – Dan Sep 23 '20 at 17:43
  • Looks like this is the ticket. Thanks for the help! – Dan Sep 23 '20 at 17:47
  • Someone else answered and then deleted. I'd like to see if it's faster... – Dan Sep 23 '20 at 17:50