1

I have two dataframes and I would like to join these dataframes based on column 'Name' of df1 and multiple columns of df2.

df1
Name            id
ZYMAXID 9416X   6390
ZYPRED          6391


df2
label    pref_label  alt_label
ZYPRED   None        None
None     ZYMAXID 9416X  None

The final output should be:

Name            id     label    pref_label     alt_label
ZYMAXID 9416X   6390   None     ZYMAXID 9416X  None
ZYPRED          6391   ZYPRED   None           None

I tried below join but it is giving me error.

df = df1.merge(df2, left_on=df1["Name"].str.lower(), right_on=['df2["label"].str.lower()','df2["pref_label"].str.lower()''df2["alt_label"].str.lower()', indicator = True)

Any help is highly appreciated.

gtomer
  • 5,643
  • 1
  • 10
  • 21
rshar
  • 1,381
  • 10
  • 28

3 Answers3

1

You could find the first non-None value in the labels in df2 using the method from this answer:

df2['Name'] = df2.fillna(method='bfill', axis=1).iloc[:, 0]

You can now merge on that value:

res = df1.merge(df2, on='Name')

Output:

            Name    id   label     pref_label  alt_label
0  ZYMAXID 9416X  6390     NaN  ZYMAXID 9416X        NaN
1         ZYPRED  6391  ZYPRED            NaN        NaN
Nick
  • 138,499
  • 22
  • 57
  • 95
0

To match you MUST have a common column (at least one) in both dataframe.

If you just want to concat (stick) the two dataframes side by side, you can:

df = pd.concat([df1, df2], axis=1)

If you want to merge df2 on either of the three labels you must do several merges:

df = df1.merge(df2['label'], left_on=df1["Name"].str.lower(), right_on='df2["label"].str.lower()', how='left')
df = df.merge(df2['label'], left_on=df1["Name"].str.lower(), right_on='df2["label"].str.lower()', how='left')
df = df.merge(df2['alt_label'], left_on=df1["Name"].str.lower(), right_on='df2["alt_label"].str.lower()', how='left')
gtomer
  • 5,643
  • 1
  • 10
  • 21
0

Another possible solution:

df1.merge(df2.assign(Name = df2.replace(np.nan, '').sum(axis=1)))

In case it is false that only one of the labels is not None, a possible solution is:

(df1.merge(
    df2.assign(Name = df2.replace(np.nan, '')
               .apply(lambda x: x.drop_duplicates().sum(), axis=1))))

Output:

            Name    id   label     pref_label alt_label
0  ZYMAXID 9416X  6390    None  ZYMAXID 9416X      None
1         ZYPRED  6391  ZYPRED           None      None
PaulS
  • 21,159
  • 2
  • 9
  • 26