1

I have this dataframe where I am trying to combine Email_x and Email_y to be a column email.

  1. If both are NaN then the result should be "" or np.nan.
  2. Email_y has a value and Email_x does not then use Email_y as the result(vice versa).
  3. If both have values then take the first one only.
                Email_x            Email_y Verification status_x Verification status_y
489   sample3@gmail.com                NaN                 valid                 valid
975    samle4@gmail.com                NaN            accept_all                   NaN
1192                NaN                NaN                   NaN                   NaN
1370  sample5@gmail.com                NaN               unknown                   NaN
2001                NaN                NaN                   NaN                   NaN
2565  sample2@gmail.com  sample2@gmail.com                 valid                   NaN
3900                NaN                NaN                   NaN                   NaN
3998                NaN                NaN                   NaN                   NaN
4192                NaN                NaN                   NaN                   NaN
4757                NaN   sample@gmail.com                   NaN                   NaN

I have tried using this but it fails when both have a value and results in both the values being combined.

df["email"] = df[["Email_x", "Email_y"]].fillna("").agg(" ".join, axis=1)

Result:

                Email_x            Email_y Verification status_x Verification status_y                                email
489   sample3@gmail.com                NaN                 valid                 valid                   sample3@gmail.com 
975    samle4@gmail.com                NaN            accept_all                   NaN                    samle4@gmail.com 
1192                NaN                NaN                   NaN                   NaN                                     
1370  sample5@gmail.com                NaN               unknown                   NaN                   sample5@gmail.com 
2001                NaN                NaN                   NaN                   NaN                                     
2565  sample2@gmail.com  sample2@gmail.com                 valid                   NaN  sample2@gmail.com sample2@gmail.com
3900                NaN                NaN                   NaN                   NaN                                     
3998                NaN                NaN                   NaN                   NaN                                     
4192                NaN                NaN                   NaN                   NaN                                     
4757                NaN   sample@gmail.com                   NaN                   NaN                     sample@gmail.com

Expected result:

                Email_x            Email_y Verification status_x Verification status_y                                email
489   sample3@gmail.com                NaN                 valid                 valid                   sample3@gmail.com 
975    samle4@gmail.com                NaN            accept_all                   NaN                    samle4@gmail.com 
1192                NaN                NaN                   NaN                   NaN                                     
1370  sample5@gmail.com                NaN               unknown                   NaN                   sample5@gmail.com 
2001                NaN                NaN                   NaN                   NaN                                     
2565  sample2@gmail.com  sample2@gmail.com                 valid                   NaN                    sample2@gmail.com
3900                NaN                NaN                   NaN                   NaN                                     
3998                NaN                NaN                   NaN                   NaN                                     
4192                NaN                NaN                   NaN                   NaN                                     
4757                NaN   sample@gmail.com                   NaN                   NaN                     sample@gmail.com

Sample dictionary:

import numpy as np

a = {
    "Email_x": {
        489: "sample3@gmail.com",
        975: "samle4@gmail.com",
        1192: np.nan,
        1370: "sample5@gmail.com",
        2001: np.nan,
        2565: "sample2@gmail.com",
        3900: np.nan,
        3998: np.nan,
        4192: np.nan,
        4757: np.nan,
    },
    "Email_y": {
        489: np.nan,
        975: np.nan,
        1192: np.nan,
        1370: np.nan,
        2001: np.nan,
        2565: "sample2@gmail.com",
        3900: np.nan,
        3998: np.nan,
        4192: np.nan,
        4757: "sample@gmail.com",
    },
    "Verification status_x": {
        489: "valid",
        975: "accept_all",
        1192: np.nan,
        1370: "unknown",
        2001: np.nan,
        2565: "valid",
        3900: np.nan,
        3998: np.nan,
        4192: np.nan,
        4757: np.nan,
    },
    "Verification status_y": {
        489: "valid",
        975: np.nan,
        1192: np.nan,
        1370: np.nan,
        2001: np.nan,
        2565: np.nan,
        3900: np.nan,
        3998: np.nan,
        4192: np.nan,
        4757: np.nan,
    },
}
yudhiesh
  • 6,383
  • 3
  • 16
  • 49

3 Answers3

2

Via np.select

condlist = [
    (df.Email_x.isna()) & (~df.Email_y.isna()), # 1st column NAN but 2nd is not 
    (df.Email_y.isna()) & (~df.Email_x.isna()), # 2nd column NAN but 1st is not
    (~df.Email_x.isna()) & (~df.Email_y.isna()) # both is not NAN
]

choicelist = [
    df.Email_y,
    df.Email_x,
    df.Email_x
]

df['Email'] = np.select(condlist,choicelist, default=‘') # default value ''
Nk03
  • 14,699
  • 2
  • 8
  • 22
  • This works but is a bit length as I need to do this for 4 more columns. If no one comes up with a better answer I will accept it :) – yudhiesh May 09 '21 at 07:38
1

Here is a great answer, which can be achieved with combine_first().

df['email'] = df['Email_x'].combine_first(df['Email_y'])
r-beginners
  • 31,170
  • 3
  • 14
  • 32
1

You can use .mask(), as follows:

df['email'] = df['Email_x'].mask(df['Email_x'].isna(), df['Email_y'])

It will retain the value of df['Email_x'] if the condition is false (i.e. not NaN) and replace with value of df['Email_y'] if the condition is true (i.e. df['Email_x'] is NaN).

SeaBean
  • 22,547
  • 3
  • 13
  • 25