How do I merge column values from one dataframe to another if they are not present in another using pandas

Question

I have two different excel files which I read using pd.readExcel. The first excel file is kind of a master file which has a lot of columns. showing only those columns which are relevant: df1

Company Name                                              Excel Company ID
0                                    cleverbridge AG      IQ109133656
1  BT España, Compañía de Servicios Globales de T...        IQ3806173
2                                   Technoserv Group       IQ40333012
3                                    Blue Media S.A.       IQ50008102
4            zeb.rolfes.schierenbeck.associates gmbh       IQ30413992

and the second excel is basically an output excel file which looks like this: df2

company_id          found_keywords  no_of_url                                       company_name
0  IQ137156215      insurance         15                         Zühlke Technology Group AG
1    IQ3806173      insurance         15  BT España, Compañía de Servicios Globales de T...
2   IQ40333012      insurance          4                                   Technoserv Group
3   IQ51614192      insurance         15                             Octo Telematics S.p.A.

I want this output excel file/ df2 to include those company_id and company name from df1 where company id and company name from df1 is not a part of df2. Something like this: df2

company_id found_keywords  no_of_url                                       company_name
0  IQ137156215      insurance         15                         Zühlke Technology Group AG
1    IQ3806173      insurance         15  BT España, Compañía de Servicios Globales de T...
2   IQ40333012      insurance          4                                   Technoserv Group
3   IQ51614192      insurance         15                             Octo Telematics S.p.A.
4   IQ30413992      NaN               NaN              zeb.rolfes.schierenbeck.associates gmbh

I tried several ways of achieveing this by using pd.merge as well as np.where I even tried reindexing based on columns but nothing worked out. What exactly do I need to do so that it works as expected. Please help me out.Thanks!

EDIT:

using pd.merge

df2.merge(df, right_on='company_id', left_on='Excel Company ID', how='outer')

which gave an output with [220 rows X 31 columns]

using pd.merge the shape of the output is 220rows * 31 columns, which is incorrect. I'll update the question the code I used. — technophile_3, May 25 '22 at 05:35
Now for your sample (df1 and df2), what is the expected output? — Corralien, May 25 '22 at 06:22
@Corralien the expected output is the same like mentioned in the question above — technophile_3, May 25 '22 at 06:25
Why "cleverbridge AG" and "Blue Media S.A." do not appear in the output? What is the difference with "BT España, Compañía de Servicios Globales de T..."? — Corralien, May 25 '22 at 06:41

score 1 · Answer 1 · answered May 25 '22 at 05:37

Your expected output is unclear. If you use pd.merge with how='outer' and indicator=True, you will have:

df1 = df1.rename(columns={'Company Name': 'company_name', 'Excel Company ID': 'company_id'})
out = df2.merge(df1, on=['company_id', 'company_name'], how='outer', indicator=True)

Output:

>>> out
    company_id found_keywords  no_of_url                                       company_name      _merge
0  IQ137156215      insurance       15.0                         Zühlke Technology Group AG   left_only
1    IQ3806173      insurance       15.0  BT España, Compañía de Servicios Globales de T...        both
2   IQ40333012      insurance        4.0                                   Technoserv Group        both
3   IQ51614192      insurance       15.0                             Octo Telematics S.p.A.   left_only
4  IQ109133656            NaN        NaN                                    cleverbridge AG  right_only
5   IQ50008102            NaN        NaN                                    Blue Media S.A.  right_only
6   IQ30413992            NaN        NaN            zeb.rolfes.schierenbeck.associates gmbh  right_only

Check the last column _merge. If you have right_only, it means the company_id and company_name are not found in df2.

let me test this solution – technophile_3 May 25 '22 at 05:39 — technophile_3, May 25 '22 at 05:39
yes..I am getting right_only in the last column _merge – technophile_3 May 25 '22 at 05:53 — technophile_3, May 25 '22 at 05:53

How do I merge column values from one dataframe to another if they are not present in another using pandas

1 Answers1