I need to compare 2 fields of 2 different dataframe if it matches we need to fill details else fill with null values in python

Question

I have 2 csv files df1

x   y  z      m
a   b  c  [apple,iphone,watch,newdevice]
e   w  q   NaN
w   r  t  [pixel,google]
s   t  q  [india,computer]

df2

new      code    file
apple    appl    ofo
lg       weew    ofe
htc      rrr     ofr
google   ggle    ofg

now i need to check m values in df1 with new value in df2 if it matches i need to combine the details of new values to df1 else we need to fill with null values I need to use python please help me

sample output

x   y  z      m                                code     file
a   b  c  [apple,iphone,watch,newdevice]       aapl     ofo
e   w  q   NaN                                 null     null
w   r  t  [pixel,google,]                      ggle     ofg
s   t  q  [india,computer]                     null     null

is the merge column always going to be the first item in the list? It simplifies it a lot if that's the case — Sven Harris, Oct 04 '18 at 17:27
@SvenH. no that may not be a case it may change ex:- [iphone , apple] or [iphone , watch , apple] or [pixel,google,pixelbuds] — Rahul Varma, Oct 04 '18 at 17:30
Is it possible for a list to map to multiple lines in `df2` like `['apple', 'google']`? — ALollz, Oct 04 '18 at 17:37

jpp · Answer 1 · 2018-10-04T17:45:41.757

1

Here is a NumPy-based approach via np.isin which tests each value in an 2d-array against a 1d-array. But really this should be considered a last resort: lists in series are inefficient and you will face performance issues for large datasets.

Note argmax will check only for the first match if multiple matches exist in a list.

import pandas as pd, numpy as np

df1 = pd.DataFrame({'x': list('aws'), 'y': list('brt'), 'z': list('ctq'),
                    'm': [['apple', 'iphone', 'watch', 'newdevice'],
                          ['google', 'pixel'], ['india', 'computer']]})

split = pd.DataFrame(df1['m'].values.tolist()).values
mask = np.isin(split, df2['new'].values).argmax(1)
df1['new'] = split[np.arange(split.shape[0]), mask]

df = pd.merge(df1, df2, on='new', how='left').drop('new', 1)

print(df)

   x  y  z                                  m  code file
0  a  b  c  [apple, iphone, watch, newdevice]  appl  ofo
1  w  r  t                    [google, pixel]  ggle  ofg
2  s  t  q                  [india, computer]   NaN  NaN

edited Oct 04 '18 at 17:45

answered Oct 04 '18 at 17:30

jpp

159,742
34
281
339

Actually I have a very big dataset, I cannot insert values manually as you have done. df1 = pd.DataFrame({'x': list('aws'), 'y': list('brt'), 'z': list('ctq'), 'm': [['apple', 'iphone', 'watch', 'newdevice'], ['google', 'pixel'], ['india', 'computer']]}) – Rahul Varma Oct 04 '18 at 17:51
@RahulVarma, Then you shouldn't use lists in series! Pandas likely isn't the right data structure for your data. – jpp Oct 04 '18 at 17:51
but data is in the same format as I have shown above . – Rahul Varma Oct 04 '18 at 17:55
@RahulVarma, I've created *precisely* what you have displayed in your question. Why not update your question with the definition of your dataframe? See [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) if you're stuck. – jpp Oct 04 '18 at 17:56

I need to compare 2 fields of 2 different dataframe if it matches we need to fill details else fill with null values in python

1 Answers1