1

I created a function that takes the entire string from any column in my dataset and extracts the email address if there is no email, it should fill the space with NaN:

def extract_email_ID(string):
    email = re.findall(r'<(.+?)>', string)
    if not email:
        email = list(filter(lambda y: '@' in y, string.split()))
    return email[0] if email else np.nan

I used the regular expression to apply the function in the "from" column of the dataset

dfs['from'] = dfs['from'].apply(lambda x: extract_email_ID(x))

But I am getting the following error TypeError: expected string or bytes-like object

gidmak
  • 51
  • 1
  • 7
  • Your last of line of code `dfs['from'].apply(...)` doesn't match with what you described earlier *"... that takes the entire string from **any column**"*. Can you include an example *as text* with the matching expected output ? – Timeless May 04 '23 at 11:49
  • There is a 'from' column that has other strings that I do not need e.g 'LinkedIn ' so I just want to extract the email address only 'notifications-noreply@linkedin.com' from that field. Sorry if I didn't put it clearly. I am new to data analysis. – gidmak May 04 '23 at 12:06
  • No worries but please consider making a [minimal reproducible example](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) + the expected output and include them to your question. – Timeless May 04 '23 at 12:07
  • A better regex would include all the conditions, like `r'<[^<>@\s]+@[^<>@\s]+>'`. This will obviously still fail if there are email addresses which are not inside brokets. – tripleee May 04 '23 at 12:32
  • @Timus, Pandas is never mentioned explicitly and is not the object of the question itself. It does not matter how the function is called. If at all, I would add numpy as np.nan is used. – Svenito May 04 '23 at 13:35
  • 1
    @Svenito It matters, because Pandas offers optimized `.str`-methods that might be a better solution than an underperforming use of `.apply`. – Timus May 04 '23 at 14:38

2 Answers2

0

It seems to me you have some non-string values in your example column dfs[from']. Perform a type check at the beginning of your function. If anything other than a string is detected, I assume you also want to return np.nan. So maybe you could insert this:

if not isinstance(string, str):
    return np.nan
Svenito
  • 188
  • 10
0

You already have an answer that fixes your function, but I don't think your approach is the best. I'd rather:

  1. Stick to Pandas-methods wherever possible. They are faster and they deal with the kind of problems you've just encountered. The method you could use here is .str.findall, the Pandas-equivalent of the re.findall.
  2. Try to find one regex-pattern that covers all the occurrences in your use case such that you don't have to branch the extraction. The pattern should only be as sophisticated as needed (you can find several approaches here).

For example, you could try (I'm not saying the pattern is optimal, it really depends on your actual data):

pattern = r"""
    (?ix)                 # Flags: i -> case insensitive, x -> verbose
    [a-z0-9]              # 1. Part - starts: with letter or digit
    [-.\w]*?              # 1. Part - middle: a mix of letters, digits, -, ., or _
    [a-z0-9]              # 1. Part - ends: with letter or digit
    @                     # Obious
    [a-z0-9]+             # 2. Part - starts: with letters and/or digits
    (?:[-.][a-z0-9]+)*    # 2. Part - middle: - or . followed by letters and/or digits
    \.[a-z]{2,}           # 2. Part - ends: with a . and then at least 2 letters
"""

email = dfs["from"].str.findall(pattern)
dfs["from"] = email.where(email.str.len() > 0)

If you use that on the following sample dataframe

dfs = pd.DataFrame({
    "from": [
        "Me <me@foO.net>; You <You@bar.org>; LinkedIn <notifications-noreply@linkedin.com>",
        "ME@foo.net you@baR.org | test123@first.second.last ab_c@a-d.z.org",
        "foobar",
        np.nan,
        123
    ]
})
0  Me <me@foO.net>; You <You@bar.org>; LinkedIn <notifications-noreply@linkedin.com>
1                  ME@foo.net you@baR.org | test123@first.second.last ab_c@a-d.z.org
2                                                                             foobar
3                                                                                NaN
4                                                                                123

you'll get the following result:

                                                                   from
0         [me@foO.net, You@bar.org, notifications-noreply@linkedin.com]
1  [ME@foo.net, you@baR.org, test123@first.second.last, ab_c@a-d.z.org]
2                                                                   NaN
3                                                                   NaN
4                                                                   NaN
Timus
  • 10,974
  • 5
  • 14
  • 28