Lets say we have two dataframe called df
which is the data frame that we want modification and a
reference data
frame as the data frame should look like;
but df
is sometimes partially missing some rows from the reference
dataframe
import pandas as pd
df = pd.DataFrame({'gr1f': ['A','A','B','A','B','A','B'],
'gr2p': ['CC','CC','CC','CC','CC','CC','CC'],
'gr3a': ['AL','AL','AL','DEL','DEL','DEL','DEL'],
'DP': [ 'l1yrs', 'l10yrs','l10yrs','l1yrs','l1yrs','l10yrs','l10yrs'],
'num_vals1' :[
1.138044999,
1.221786568,
1.7,
1.159030763,
1.2,
1.5,
1.238341765 ],
'num_vals2' :[
0.166637328,
0.04513741,
0.12,
0.141340498,
0.87,
0.033441602,
0.01233441602]})
df
reference_df = pd.DataFrame({
'gr2p': ['CC','CC','CC','CC','DD','DD','DD','DD'],
'gr3a': ['AL','AL','AL','AL','DEL','DEL','DEL','DEL'],
'DP': ['l1yrs', 'l1yrs', 'l10yrs','l10yrs','l1yrs','l1yrs','l10yrs','l10yrs']})
reference_df
to merge this two data frame I used right join Pandas Merging 101
df1 = df.merge(reference_df[['gr2p','gr3a','DP']], on = ['gr2p','gr3a','DP'], how='right').sort_values(by =['gr1f','gr2p','DP']).reset_index(drop=True)
df1
which yields
The expected output that I need to have
gr1f gr2p gr3a DP num_vals1 num_vals2 source
0 A CC AL l10yrs 1.221787 0.045137 df
1 B CC AL l10yrs 1.700000 0.120000 df
2 A CC AL l1yrs 1.138045 0.166637 df
3 B CC AL l1yrs NaN NaN reference_df
4 A CC DEL l10yrs 1.50000 0.033441602 df
5 B CC DEL l10yrs 1.238342 0.012334 df
6 A CC DEL l1yrs 1.15903 0.1413404 df
7 B CC DEL l1yrs 1.20000 0.87 df
8 A DD DEL l10yrs NaN NaN reference_df
9 B DD DEL l10yrs NaN NaN reference_df
10 A DD DEL l1yrs NaN NaN reference_df
11 B DD DEL l1yrs NaN NaN reference_df
how can we achieve this final table with filling missing rows gr1f
with A
and B
combinations and DP
with 'l10yrs, l1yrs' combination. I think the expected output is showing the pattern so its a good guide.