0

I am trying to create a new column email, that would combine data from two columns: email_first_part and domain.

def new_email_column(first_part, domain):
    if first_part == "nan" & domain =="":
        return missing 
    else:
        return first_part + "@" + domain



df["new_email"] = df["new_email"].apply(df.loc["email_first_part"], df.loc["domain"])

The new_email column, uses values from other two columns, check for conditions given in function, and updates the values. I am getting an error, when I try to do this. How to I get this working. Thank You.

chexxmex
  • 117
  • 1
  • 8

2 Answers2

1

Try with:

df["new_email"] = df.apply(lambda x: new_email_column(x["email_first_part"], x["domain"]), axis=1)

As you want to use data in other columns, you cannot use apply() on the "new_email" column (i.e. pandas Series df["new_email"]) as in your original code df["new_email"]).apply(...). You have to use the apply on the the whole DataFrame df (or selected columns with the specific columns you want to use).

You need to add axis=1 to the apply() function so as to work on the column axis (i.e. perform row-wise operation with passing row data with all columns to your apply() function). Without axis=1, you are working on columns one by one where you can only access the row-index rather than column indice/labels.

Using the lambda function allows you to call your custom function new_email_column() without modification of the function. Otherwise, you need to amend the function to include one more parameter for the passed-in row Series).

If you want to better understand this convention of using apply(..., axis=1) and also explore another calling convention with better system performance (execution time), you can refer to this post for further information.

SeaBean
  • 22,547
  • 3
  • 13
  • 25
0

An efficient way to do this would be to replace the empty strings with NaN and then just concatenate the columns. It is always good to keep missing values as NaNs rather than a string.

df = df.replace({'': np.nan})
df['new_email'] = df['email_first_part'] + '@' + df['domain']

Input

  email_first_part       domain
0              abc    gmail.com
1              xyz  outlook.com
2              NaN             

Output

  email_first_part       domain        new_email
0              abc    gmail.com    abc@gmail.com
1              xyz  outlook.com  xyz@outlook.com
2              NaN          NaN              NaN
Vishnudev Krishnadas
  • 10,679
  • 2
  • 23
  • 55
  • So, I am using some functions and when apply on columns, if the column type is str, python throws an error, since nan values are float. is there anyway to fix this ? – chexxmex Feb 18 '21 at 18:42
  • Vectorize them and pandas will handle better than pure python. By vectorized I mean try not to use apply whenever possible. – Vishnudev Krishnadas Feb 19 '21 at 05:30