2

I have two dataframes df1 and df2. I would like to get the pred_label from df2 to get assigned to the corresponding bar_cdin df1. there are some duplicate bar_cd like 1006040448 in df1. I would not like to lose them due to the merge operation. I tried left merge but ended up getting null values for pred_label

Both df1 and df2 have similar bar_cd.

If I would like to assign the pred_label to df1 which approach would be suitable. I would appreciate your advice. Is there a way to map pred_label to bar_cd in df1 I have referred to the article [Merge 101][1]. But it did not work for me.

df1:

     bar_cd     actual_label
0   1006036382  3.0
1   1006040448  3.0
2   1006040448  3.0
3   1006044789  3.0
4   1006044789  3.0

df2: (consider this as master file for mapping)

         bar_cd     actual_label  pred_label
133494  1006036382  3.00           3.00
180288  1006040448  3.00           4.00
122732  1006044789  3.00           4.00
38225   1006808018  1.00           2.00
205799  1008874962  2.00           1.00




expected result:

     bar_cd     actual_label      pred_lable 
0   1006036382  3.0               3.0
1   1006040448  3.0               4.0
2   1006040448  3.0               4.0
3   1006044789  3.0               4.0
4   1006044789  3.0               4.0
... ... ...


  [1]: https://stackoverflow.com/questions/53645882/pandas-merging-101
Sam
  • 352
  • 2
  • 4
  • 22

2 Answers2

1

Try creating a dictionary with your bar_cd as as your keys and pred_label as your values from df2, and then map it back to df1.

d = dict(zip(df2.bar_cd,df2.pred_label))
df1['pred_label'] = df1['bar_cd'].map(d)

df1
       bar_cd  actual_label  pred_label
0  1006036382             3           3
1  1006040448             3           4
2  1006040448             3           4
3  1006044789             3           4
4  1006044789             3           4
sophocles
  • 13,593
  • 3
  • 14
  • 33
  • Is creating a dictionary efficient if I have 300k unique `bar_cd` ? My question should have been more precise regarding the unique number of `bar_cd` – Sam Mar 26 '21 at 09:14
  • all the values for `pred_label` are coming out to be `Nan`. Is there anything missing? – Sam Mar 26 '21 at 09:35
  • Sorry for the delays in replying. The code works well for me. Are you sure the bar_cd on the 2 datasets are the same? – sophocles Mar 26 '21 at 09:38
  • It seems so. The number of unique values in both dataframes is same. Is there a way to verify if both dataframes have same `bar_cd` ? – Sam Mar 26 '21 at 09:55
  • I tried `df1.bar_cd.unique() == df2.bar_cd.unique()` it shows me a lot of false values. But I think my approach of checking this condition is wrong. – Sam Mar 26 '21 at 09:58
  • check for common ones using a list comprehension -> ```[i for i in df1['bar_cd'].unique() if i in df2['bar_cd'].unique()]```. This will give you a list of the common ones. – sophocles Mar 26 '21 at 10:16
0

Try:

# Separate duplicate values
dup = df1.duplicated(subset=['bar_cd'], keep='last')
dup = dup.sort_values(by=["bar_cd])

rem = df1[~dup]
rem = rem.sort_values(by=["bar_cd])


# Reorder df2
df2 = df2.sort_values(by=["bar_cd])

# Merge each df with df2
new_df1 = pd.merge(
    right=rem,
    left=df2,
    how="right",
    on=["bar_cd"],
)

new_df2 = pd.merge(
    right=dup,
    left=df2,
    how="right",
    on=["bar_cd"],
)

new_df = pd.concat([new_df1, new_df2])
Laurent
  • 12,287
  • 7
  • 21
  • 37