I have a DataFrame
with some user input (it's supposed to just be a plain email address), along with some other values, like this:
import pandas as pd
from pandas import Series, DataFrame
df = pd.DataFrame({'input': ['Captain Jean-Luc Picard <picard@starfleet.com>','deanna.troi@starfleet.com','data@starfleet.com','William Riker <riker@starfleet.com>'],'val_1':[1.5,3.6,2.4,2.9],'val_2':[7.3,-2.5,3.4,1.5]})
Due to a bug, the input sometimes has the user's name as well as brackets around the email address; this needs to be fixed before continuing with the analysis.
To move forward, I want to create a new column that has cleaned versions of the emails: if the email contains names/brackets
then remove those, else just give the already correct email.
There are numerous examples of cleaning string data with Python/pandas
, but I've yet to find successfully implement any of these suggestions. Here are a few examples of what I've tried:
# as noted in pandas docs, turns all non-matching strings into NaN
df['cleaned'] = df['input'].str.extract('<(.*)>')
# AttributeError: type object 'str' has no attribute 'contains'
df['cleaned'] = df['input'].apply(lambda x: str.extract('<(.*)>') if str.contains('<(.*)>') else x)
# AttributeError: 'DataFrame' object has no attribute 'str'
df['cleaned'] = df[df['input'].str.contains('<(.*)>')].str.extract('<(.*)>')
Thanks!