how to join 2 table with "conditional" full join

Question

Background

I am a Machine Learning Engineer and want to analyze model performance. My model should recall serveral items in each request_id(For example, a pageview == request_id).

In each scenario(For instance, pageview in different countries), model should recall something particularly what I want(taking this as groundtruth). However, model result is not ideal at all(taking this as prediction). I want to compare model prediction with the groundtruth.

In one word, Table A misses something and I want join Table B to Table A to make it full on some condition. I can hardly tell what "SQL join method" can be used here, so I use "conditional full join" in the title at the present, e.g. for each request_id and country, full join prediction and groundtruth together.

Data

This is model prediction for 3 request_id:

import pandas as pd
import numpy as np
df1 = pd.DataFrame({'country': ["UK","UK","UK","UK","JP","JP","JP","JP"],
                    'request_id': [0,0,0,0,1,1,2,2],
                    'item_predict': ['item0', 'item3','item4','item5' ,'item0', 'item1','item0', 'item1']})
df1

Output:

    country request_id  item_predict
0   UK  0   item0
1   UK  0   item3
2   UK  0   item4
3   UK  0   item5
4   JP  1   item0
5   JP  1   item1
6   JP  2   item0
7   JP  2   item1

This is the ideal items for each country:

df2 = pd.DataFrame({'country': ["UK","UK","UK", "JP", "JP"], 'item_ground_truth': ['item0', 'item1', 'item2', 'item0', 'item2']})
df2

Output:

    country item_ground_truth
0   UK  item0
1   UK  item1
2   UK  item2
3   JP  item0
4   JP  item2

What I want to know is: for each request_id, what items are missing in model prediction?

# this is what I want. I want to know the missing item for each request_id.
df_wanted = pd.DataFrame({'country': ["UK","UK","UK","UK", "UK", "UK","JP","JP","JP","JP","JP","JP"],
                    'request_id': [0,0,0,0,0,0,1,1,1,2,2,2],
                    'item_predict': ['item0', 'item3','item4','item5',np.nan ,np.nan,'item0', 'item1', np.nan, 'item0', 'item1', np.nan],
                    'item_ground_truth': ['item0', np.nan, np.nan,np.nan,'item1','item2' ,'item0', 'item1', 'item2', 'item0', 'item1', 'item2']})

Output:

    country request_id  item_predict    item_ground_truth
0   UK  0   item0   item0
1   UK  0   item3   NaN
2   UK  0   item4   NaN
3   UK  0   item5   NaN
4   UK  0   NaN item1   -> for request_id ==0, it lost item1
5   UK  0   NaN item2   -> for request_id ==0, it lost item2
6   JP  1   item0   item0
7   JP  1   item1   item1
8   JP  1   NaN item2   -> for request_id ==1, it lost item2
9   JP  2   item0   item0
10  JP  2   item1   item1
11  JP  2   NaN item2   -> for request_id ==2, it lost item2

My attempt:

# this is not right. 
# Reason 1: We lost request_id for some records. I don't know which request_id lost item2.
# Reason 2: The number of records is reduced. For request_id == 0, model did not recall item1 and item2. For request_id == 1/2, model did not recall item2, so I need 4 records instead 3 of them.
df1.merge(df2, how = 'outer', left_on= ['country', 'item_predict'], right_on= ['country', 'item_ground_truth'])

Output:

    country request_id  item_predict    item_ground_truth
0   UK  0.0 item0   item0
1   UK  0.0 item3   NaN
2   UK  0.0 item4   NaN
3   UK  0.0 item5   NaN
4   JP  1.0 item0   item0
5   JP  2.0 item0   item0
6   JP  1.0 item1   NaN
7   JP  2.0 item1   NaN
8   UK  NaN NaN item1 --> for request_id ==0, it lost item 1
9   UK  NaN NaN item2 --> for request_id ==0, it lost item 2
10  JP  NaN NaN item2 --> for request_id ==1/2, they both lost item 2. But there is only 1 record here

I know in SQL server you can use nested queries with "Where not exists" to get the result you want. I found this post that relates to your framework: https://stackoverflow.com/questions/59728078/how-to-select-records-with-not-exists-condition-in-pandas-dataframe This is how it would look like in SQL Server: `SELECT column_name FROM Table_Name WHERE NOT EXISTS (SELECT column_name FROM Table_Name WHERE condition);` — Swedo, Aug 18 '22 at 10:51
Please: Putting words in scare quotes does not clarify the idiosyncratic meaning that you don't make clear by actually saying what you mean. Use enough words, sentences & references to parts of examples to clearly & fully say what you mean. When giving a business relation(ship)/association or table (base or query result), say what a row in it states about the business situation in terms of its column values. — philipxy, Sep 05 '22 at 14:21
Debug questions require a [mre]--cut & paste & runnable code including initialization; desired & actual output (including verbatim error messages); tags & versions; clear specification & explanation. For SQL include DDL & tabular initialization code. For debug that includes the least code you can give that is code that you show is OK extended by code that you show is not OK. [ask] [Help] When you get a result you don't expect, pause your overall goal, chop to the 1st subexpression with unexpected result & say what you expected & why, justified by documentation. (Debugging fundamental.) — philipxy, Sep 05 '22 at 14:22
@philipxy Thank you, I put my narratives in the Background part and all code in Data. However, I still dont make myself clear, sry for inconvenience. — Travis, Sep 07 '22 at 06:29
Please either ask about 1 bad query/function with the obligatory [mre] & why you think it should return something else at the 1st subexpression that it doesn't give what you expect, justified by reference to authoritative documentation, or ask about your overall goal giving working parts you can do & ideally a [mre]. But please ask about the former 1st because misconceptions in the former will get in the way of understanding the latter. And bad code doesn't tell us what you wish it would do. PS Please review & act on all the rest of the comments. — philipxy, Sep 07 '22 at 09:38
[Is there any rule of thumb to construct SQL query from a human-readable description?](https://stackoverflow.com/a/33952141/3404097) — philipxy, Sep 07 '22 at 09:41
I think I specify my abstract goal in the third paragraph of background subtitle and a concrete way in `What I want to know is: for each request_id, what items are missing in model prediction?` in data subtitle. As for reproducible examples, I provide `df1` and `df2` as input and emphasize `df_wanted` as the output I need. The code itself is reproducible....What else can I do? Can you help me to revise my question? — Travis, Sep 18 '22 at 08:51

Gonçalo Peres · Answer 1 · 2022-09-05T14:30:15.297

After carefully reading the question, ended up with two potential interpretations of what OP wants:

A) Given df_wanted, detect the elements where item_predict is null.

B) The goal is to generate a df_wanted by merging df1 and df2, guaranteeing that request_id are not removed.

Interpretation A

Considering the dataframe df_wanted that OP mentions in the question

import pandas as pd
import numpy as np

df_wanted = pd.DataFrame({'country': ["UK","UK","UK","UK", "UK", "UK","JP","JP","JP","JP","JP","JP"],
                    'request_id': [0,0,0,0,0,0,1,1,1,2,2,2],
                    'item_predict': ['item0', 'item3','item4','item5',np.nan ,np.nan,'item0', 'item1', np.nan, 'item0', 'item1', np.nan],
                    'item_ground_truth': ['item0', np.nan, np.nan,np.nan,'item1','item2' ,'item0', 'item1', 'item2', 'item0', 'item1', 'item2']})

Assuming one wants to know, from df_wanted, the request_id, country, and item_ground_truth for the rows where item_predict is null, one can create a separate dataframe df and use DataFrame.isnull as follows

df = df_wanted[df_wanted['item_predict'].isnull()]    

[Out]:
   country  request_id item_predict item_ground_truth
4       UK           0          NaN             item1
5       UK           0          NaN             item2
8       JP           1          NaN             item2
11      JP           2          NaN             item2

Interpretation B

For that one will need to clarify what one really wants. I recommend OP to go through this thread: Pandas Merging 101

In pandas.DataFrame.merge, the argument on is a list of columns to join on, the how specifies the type of join. An outer join keeps all rows from both df1 and df2, even if they don't match on the join columns, such as

df_wanted = df1.merge(df2, on=['country'], how='left')

print(df_wanted.head())

[Out]:
   country  request_id item_predict item_ground_truth
0       UK           0        item0             item0
1       UK           0        item0             item1
2       UK           0        item0             item2
3       UK           0        item3             item0
4       UK           0        item3             item1
5       UK           0        item3             item2
6       UK           0        item4             item0
7       UK           0        item4             item1
8       UK           0        item4             item2
9       UK           0        item5             item0
10      UK           0        item5             item1
11      UK           0        item5             item2
12      JP           1        item0             item0
13      JP           1        item0             item2
14      JP           1        item1             item0
15      JP           1        item1             item2
16      JP           2        item0             item0
17      JP           2        item0             item2
18      JP           2        item1             item0
19      JP           2        item1             item2

Thank you bro, what I want is (B). I need to generate `df_wanted` based on `df_1`(prediction) and `df2`(groundtruth) — Travis, Sep 07 '22 at 06:30
@Travis can you specify the requirements to consider for that merge (what do we want from `df1` and what do we want from `df2`)? The desired output, and how does that differ from what is already in the answer, will also be helpful. — Gonçalo Peres, Sep 07 '22 at 07:29
`df1` is model predtion, `df2` is the ground truth. Taking `request_id = 0` as example, model think we should provide item0/3/4/5 to UK. However, the truth is we need to provide item0/1/2 to UK. I want to know the missing item for model, e.g. item 1/2. The desired output should contain all information about `df1` and `df2` and give me ①the union of `df1` and `df2` ②what is missing in `df1` — Travis, Sep 18 '22 at 08:59

how to join 2 table with "conditional" full join

Background

Data

My attempt:

1 Answers1