You already have an answer that fixes your function, but I don't think your approach is the best. I'd rather:
- Stick to Pandas-methods wherever possible. They are faster and they deal with the kind of problems you've just encountered. The method you could use here is
.str.findall
, the Pandas-equivalent of the re.findall
.
- Try to find one regex-pattern that covers all the occurrences in your use case such that you don't have to branch the extraction. The pattern should only be as sophisticated as needed (you can find several approaches here).
For example, you could try (I'm not saying the pattern is optimal, it really depends on your actual data):
pattern = r"""
(?ix) # Flags: i -> case insensitive, x -> verbose
[a-z0-9] # 1. Part - starts: with letter or digit
[-.\w]*? # 1. Part - middle: a mix of letters, digits, -, ., or _
[a-z0-9] # 1. Part - ends: with letter or digit
@ # Obious
[a-z0-9]+ # 2. Part - starts: with letters and/or digits
(?:[-.][a-z0-9]+)* # 2. Part - middle: - or . followed by letters and/or digits
\.[a-z]{2,} # 2. Part - ends: with a . and then at least 2 letters
"""
email = dfs["from"].str.findall(pattern)
dfs["from"] = email.where(email.str.len() > 0)
If you use that on the following sample dataframe
dfs = pd.DataFrame({
"from": [
"Me <me@foO.net>; You <You@bar.org>; LinkedIn <notifications-noreply@linkedin.com>",
"ME@foo.net you@baR.org | test123@first.second.last ab_c@a-d.z.org",
"foobar",
np.nan,
123
]
})
0 Me <me@foO.net>; You <You@bar.org>; LinkedIn <notifications-noreply@linkedin.com>
1 ME@foo.net you@baR.org | test123@first.second.last ab_c@a-d.z.org
2 foobar
3 NaN
4 123
you'll get the following result:
from
0 [me@foO.net, You@bar.org, notifications-noreply@linkedin.com]
1 [ME@foo.net, you@baR.org, test123@first.second.last, ab_c@a-d.z.org]
2 NaN
3 NaN
4 NaN