0

I have two dataframes with similar data. I want to merge them to combine all the information into one dataframe. The problem is, I would like to prioritize data from one dataframe if there are conflicts between merge of multiple columns (df1 in the example). And also I want to do it on multiple columns if ANY of the chosen columns match.

I apologize if my explanation is not clear enough. If there is any other information I should provide please let me know.

The way I do it now. This works fine if I would choose only one column but I can't figure out how to do it on multiple.

merge_by = ['id', 'name1', 'name2']
a = df1.merge(df2, how='outer', on=merge_by)

So how I would imagine this to work is
df1.merge(df2, how='outer', on='id' or 'name1' or 'name2')
df1= DataFrame([
    [0, 'john', 'bon', 'ron'],
    [1, 'alex', 'dale', 'bruce'],
    [2, 'joey', 'bill', 'maci'],
    [3, 'choi', 'nath', 'karl'],
    [4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])


id   name1   name2   name3
0    'john'   'bon'   'ron'
1    'alex'   'dale'  'bruce'
2    'joey'   'bill'  'maci'
3    'choi'   'nath'  'karl'
4    'walt'   ''      'xander'
df2= DataFrame([
    [0, 'emil', 'tia', 'bia'],
    [4, '', 'sara', 'carmen'],
    [5, 'aden', 'dale', 'leia'],
    [6, 'joey', 'jax', 'jace'],
    [7, 'choi', 'nath', 'andre'],
    [8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])


id   name1   name2   name3
0    'emil'   'tia'   'bia'
4    ''       'sara'  'carmen'
5    'aden'   'dale'  'leia'
6    'joey'   'jax'   'jace'
7    'choi'   'nath'  'andre'
8    ''       ''      'piper'

The output I would want

id   name1   name2   name3_x name3_y
0    'john'   'bon'   'ron'   'bia'
1    'alex'   'dale'  'bruce' 'leia'
2    'joey'   'bill'  'maci'  'jace'
3    'choi'   'nath'  'karl'  'andre'
4    'walt'   'sara'  'xander' 'carmen'
8    ''       ''      ''      'piper'

Edit** Code taken from the answer here as suggested in the comments below.

df1= pd.DataFrame([
    [0, 'john', 'bon', 'ron'],
    [1, 'alex', 'dale', 'bruce'],
    [2, 'joey', 'bill', 'maci'],
    [3, 'choi', 'nath', 'karl'],
    [4, 'walt', '', 'xander'],
], columns=['id','name1','name2','name3'])

df2= pd.DataFrame([
    [0, 'emil', 'tia', 'bia'],
    [4, '', 'sara', 'carmen'],
    [5, 'aden', 'dale', 'leia'],
    [6, 'joey', 'jax', 'jace'],
    [7, 'choi', 'nath', 'andre'],
    [8, '', '', 'piper'],
], columns=['id','name1','name2','name3'])

suff_A = ['_on_A_match_1', '_on_A_match_2']
suff_B = ['_on_B_match_1', '_on_B_match_2']
suff_C = ['_on_C_match_1', '_on_C_match_2']

df = pd.concat([df1.merge(df2[df2['id'] != ''], on='id', suffixes=suff_A), 
                df1.merge(df2[df2['name1'] != ''], on='name1', suffixes=suff_B),
                df1.merge(df2[df2['name2'] != ''], on='name2', suffixes=suff_C)])

dups = (df.id_on_B_match_1 == df.id_on_B_match_2) # also could remove A_on_B_match
a = df.loc[~dups]
print(df)

enter image description here

The problem with this one is that the id 3 is repeated, I am not sure how to set up dups with more than 2 columns. And also how could I format the final output to be only the answers that I want?

  • Hi! Does the approach to this [question](https://stackoverflow.com/questions/45869886/python-pandas-how-to-merge-based-on-an-or-condition) answer yours? – Simon David May 11 '23 at 04:23
  • Maybe you could make a loop and run it that many times as as the number of columns to be merged? – Xcape9797 May 11 '23 at 05:19
  • The comment from @SimonGsponer seems to be something that could help me but I have trouble implementing it to get an answer in a format I would like. Also the output gives one more extra field (to the example dataframes I used here). – IonicEcommerce May 11 '23 at 05:36
  • then you used merge function, it is logical function, however i do not find you last panel logic. can you show which columns and how do you want? – Tornike Kharitonishvili May 11 '23 at 05:44
  • I am not exactly sure what you mean but I have edited the post and added the code from the post that @SimonGsponer linked. – IonicEcommerce May 11 '23 at 05:58

1 Answers1

1
columns = "id", "name1", "name2"

df = pd.concat(
   df1.merge(df2.dropna(subset=column), on=column, suffixes=["", "_y"])
   for column in columns
).drop_duplicates("id")

ids = set(df["id"].dropna()).union(df["id_y"].dropna())

pd.concat([
   df,
   df1[~df1["id"].isin(ids)],
   df2[~df2["id"].isin(ids)]
])
   id name1 name2   name3 name1_y name2_y name3_y  id_y
0   0  john   bon     ron    emil     tia     bia   NaN
1   4  walt   NaN  xander     NaN    sara  carmen   NaN
0   2  joey  bill    maci     NaN     jax    jace   6.0
1   3  choi  nath    karl     NaN    nath   andre   7.0
0   1  alex  dale   bruce    aden     NaN    leia   5.0
5   8   NaN   NaN   piper     NaN     NaN     NaN   NaN
jqurious
  • 9,953
  • 1
  • 4
  • 14