0

I want to merge two datafames (A and B) on a key say X, so that if there is a common column K (assume we dont know the name of this column) with same name in both dataframes then we should keep values where it is present

Example for row 1: if A has value in K (and not B) then keep A's value, but for row 2: if B has value in K (and not A) then keep that value. If present in both then keep anyone.

We dont know the column name. So pls don’t mark it as already solved with pd.merge and then apply with column names. See picture for explanation ( there might be 50 columns like Name here)

Like this

Sumit Singh
  • 129
  • 10
  • Some concrete example input and output would be helpful here – Code Different Sep 13 '19 at 14:00
  • Added a link for better example – Sumit Singh Sep 13 '19 at 14:35
  • you can use `pd.merge()` check its [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html) – ElSheikh Sep 13 '19 at 14:36
  • Use: pd.merge(df1, df2, all.x=True) but this is a duplicated question and have been answered before: https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right, and https://stackoverflow.com/questions/43499709/merge-dataframes-based-on-common-columns-but-keeping-all-rows-from-x – Mack123456 Sep 13 '19 at 14:52
  • Possible duplicate of [How to join (merge) data frames (inner, outer, left, right)](https://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right) – Mack123456 Sep 13 '19 at 14:56
  • @Mack123456 pd.merge(df1, df2, all.x=True) doesn't work, and the questions you posted refer to R code not Python code – Massifox Sep 14 '19 at 08:27
  • 1
    Agreed, however code by @Massifox solves the problem. – Sumit Singh Sep 16 '19 at 06:16

1 Answers1

4

Use pd.merge(). See this example:

df_a = pd.DataFrame({'subject_id': ['1', '2', '3', '4', '5'],
                     'first_name': ['Alex', 'Amy', 'Allen', 'Alice', None]} , 
                    columns = ['subject_id', 'first_name'])

df_b = pd.DataFrame({'subject_id': ['4', '5', '6', '7', '8'],
                     'first_name': [None, 'Brian', 'Bran', 'Bryce', 'Betty']}, 
                    columns = ['subject_id', 'first_name'])

Using pd.merge() do the outer join on 'subject_id':

df_merge = pd.merge(df_a, df_b, on='subject_id', how='outer')

and you get:

subject_id first_name_x first_name_y
0          1         Alex          NaN
1          2          Amy          NaN
2          3        Allen          NaN
3          4        Alice         None
4          5         None        Brian
5          6          NaN         Bran
6          7          NaN        Bryce
7          8          NaN        Betty

You can use where to conditionally assign a value from the _x and _y columns:

df_merge['first_name'] = df_merge['first_name_x'].where(df_merge['first_name_x'].notnull(), df_merge['first_name_y'])

Using .drop() method to delete _x and _y columns:

df_res = df_merge.drop(['first_name_x', 'first_name_y'], axis=1)

And get the result you are looking for:

print(df_res)
  subject_id first_name
0          1       Alex
1          2        Amy
2          3      Allen
3          4      Alice
4          5      Brian
5          6       Bran
6          7      Bryce
7          8      Betty

If you don't know the name of the columns that will be generated from the merge, you can use this code:

suffixes = ('__x', '__y')
df_merge = pd.merge(df_a, df_b, on='subject_id', how='outer', suffixes=suffixes)

drop_cols = []
for col in df_merge.columns:
    if col.endswith(suffixes[0]):
        new_col = col[:-len(suffixes[0])]
        df_merge[new_col] = df_merge[col].where(df_merge[col].notnull(), df_merge[new_col + suffixes[1]])
        drop_cols.extend([col, new_col + suffixes[1]])

df_merge.drop(columns=drop_cols, inplace=True)

Note: For a more detailed explanation you can refer to the official documentation, while here you will find many examples.

Massifox
  • 4,369
  • 11
  • 31
  • Only catch is, I don’t know the column names (Name here).. there could be 500 columns which are in both fd) – Sumit Singh Sep 14 '19 at 04:18
  • Ok @SumitSingh, I updated my answer and added the code to answer also your last question. If that's what you needed mark my answer as best, thanks a lot :) – Massifox Sep 14 '19 at 08:16
  • seems like i dont have much reputation - Votes cast by those with less than 15 reputation are recorded, but do not change the publicly displayed post score. – Sumit Singh Sep 17 '19 at 10:14