Background
I am a Machine Learning Engineer and want to analyze model performance. My model should recall serveral items in each request_id(For example, a pageview == request_id).
In each scenario(For instance, pageview in different countries), model should recall something particularly what I want(taking this as groundtruth). However, model result is not ideal at all(taking this as prediction). I want to compare model prediction with the groundtruth.
In one word, Table A
misses something and I want join Table B
to Table A
to make it full on some condition. I can hardly tell what "SQL join method" can be used here, so I use "conditional full join" in the title at the present, e.g. for each request_id and country, full join prediction and groundtruth together.
Data
This is model prediction for 3 request_id:
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'country': ["UK","UK","UK","UK","JP","JP","JP","JP"],
'request_id': [0,0,0,0,1,1,2,2],
'item_predict': ['item0', 'item3','item4','item5' ,'item0', 'item1','item0', 'item1']})
df1
Output:
country request_id item_predict
0 UK 0 item0
1 UK 0 item3
2 UK 0 item4
3 UK 0 item5
4 JP 1 item0
5 JP 1 item1
6 JP 2 item0
7 JP 2 item1
This is the ideal items for each country:
df2 = pd.DataFrame({'country': ["UK","UK","UK", "JP", "JP"], 'item_ground_truth': ['item0', 'item1', 'item2', 'item0', 'item2']})
df2
Output:
country item_ground_truth
0 UK item0
1 UK item1
2 UK item2
3 JP item0
4 JP item2
What I want to know is: for each request_id, what items are missing in model prediction?
# this is what I want. I want to know the missing item for each request_id.
df_wanted = pd.DataFrame({'country': ["UK","UK","UK","UK", "UK", "UK","JP","JP","JP","JP","JP","JP"],
'request_id': [0,0,0,0,0,0,1,1,1,2,2,2],
'item_predict': ['item0', 'item3','item4','item5',np.nan ,np.nan,'item0', 'item1', np.nan, 'item0', 'item1', np.nan],
'item_ground_truth': ['item0', np.nan, np.nan,np.nan,'item1','item2' ,'item0', 'item1', 'item2', 'item0', 'item1', 'item2']})
Output:
country request_id item_predict item_ground_truth
0 UK 0 item0 item0
1 UK 0 item3 NaN
2 UK 0 item4 NaN
3 UK 0 item5 NaN
4 UK 0 NaN item1 -> for request_id ==0, it lost item1
5 UK 0 NaN item2 -> for request_id ==0, it lost item2
6 JP 1 item0 item0
7 JP 1 item1 item1
8 JP 1 NaN item2 -> for request_id ==1, it lost item2
9 JP 2 item0 item0
10 JP 2 item1 item1
11 JP 2 NaN item2 -> for request_id ==2, it lost item2
My attempt:
# this is not right.
# Reason 1: We lost request_id for some records. I don't know which request_id lost item2.
# Reason 2: The number of records is reduced. For request_id == 0, model did not recall item1 and item2. For request_id == 1/2, model did not recall item2, so I need 4 records instead 3 of them.
df1.merge(df2, how = 'outer', left_on= ['country', 'item_predict'], right_on= ['country', 'item_ground_truth'])
Output:
country request_id item_predict item_ground_truth
0 UK 0.0 item0 item0
1 UK 0.0 item3 NaN
2 UK 0.0 item4 NaN
3 UK 0.0 item5 NaN
4 JP 1.0 item0 item0
5 JP 2.0 item0 item0
6 JP 1.0 item1 NaN
7 JP 2.0 item1 NaN
8 UK NaN NaN item1 --> for request_id ==0, it lost item 1
9 UK NaN NaN item2 --> for request_id ==0, it lost item 2
10 JP NaN NaN item2 --> for request_id ==1/2, they both lost item 2. But there is only 1 record here