I have two different excel files which I read using pd.readExcel
. The first excel file is kind of a master file which has a lot of columns. showing only those columns which are relevant:
df1
Company Name Excel Company ID
0 cleverbridge AG IQ109133656
1 BT España, Compañía de Servicios Globales de T... IQ3806173
2 Technoserv Group IQ40333012
3 Blue Media S.A. IQ50008102
4 zeb.rolfes.schierenbeck.associates gmbh IQ30413992
and the second excel is basically an output excel file which looks like this: df2
company_id found_keywords no_of_url company_name
0 IQ137156215 insurance 15 Zühlke Technology Group AG
1 IQ3806173 insurance 15 BT España, Compañía de Servicios Globales de T...
2 IQ40333012 insurance 4 Technoserv Group
3 IQ51614192 insurance 15 Octo Telematics S.p.A.
I want this output excel file/ df2 to include those company_id and company name from df1 where company id and company name from df1 is not a part of df2. Something like this: df2
company_id found_keywords no_of_url company_name
0 IQ137156215 insurance 15 Zühlke Technology Group AG
1 IQ3806173 insurance 15 BT España, Compañía de Servicios Globales de T...
2 IQ40333012 insurance 4 Technoserv Group
3 IQ51614192 insurance 15 Octo Telematics S.p.A.
4 IQ30413992 NaN NaN zeb.rolfes.schierenbeck.associates gmbh
I tried several ways of achieveing this by using pd.merge
as well as np.where
I even tried reindexing based on columns but nothing worked out. What exactly do I need to do so that it works as expected. Please help me out.Thanks!
EDIT:
using pd.merge
df2.merge(df, right_on='company_id', left_on='Excel Company ID', how='outer')
which gave an output with [220 rows X 31 columns]