Get multiple column value based on partial matching with another column value for pandas dataframe

Question

I have a following dataframe:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['ole.norlin@mail.dk', 'ole.gregersen@hk.dk', 'prima-rent@youseepost.dk', 'jb@rentind.dk', 'frisoren01@gmail.com','stigterndrup@gmail.com', 'psn@psn.dk', 'samuel@malerfirmaet-lykkebo.dk', 'jan@mundt-reklame.dk',  'nordsjalland@phonixtag.dk', 'jp@rudersdalmaleren.dk', 'vvs@hestetangen.dk', 'steenkragelund@mail.tele.dk', 'kasserer@keilstruplund.dk']},                                        
                                 {'main_url': 'http://kirsebaergaarden.com', 'emails': ['info@kirsebaergaarden.com','ghost1054@yahoo.com']},
                                 {'main_url': 'http://koglernes.dk', 'emails': ['info@koglernes.dk']},
                                  {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                               ])

However, I want to keep only those values for property named "emails" whose every element's value after '@' is same as the corresponding value of the 'main_url' property but after "http://" resulting the following data frame:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['kasserer@keilstruplund.dk']},                                        
                                 {'main_url': 'http://kirsebaergaarden.com', 'emails': ['info@kirsebaergaarden.com']},
                                 {'main_url': 'http://koglernes.dk', 'emails': ['info@koglernes.dk']},
                                  {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                               ])

Any hints or approach is appreciable considering the fact that I have millions row to implement the transformation

Onur Guven · Accepted Answer · 2022-04-12T10:59:21.663

Give this a try I think it should be able to handle a few millions of rows.

def list_check(emails_list, email_match):
    match_indexes = [i for i, s in enumerate(emails_list) if email_match in s]
    return [emails_list[index] for index in match_indexes]

# Parse main_url to get domain column
df['domain'] = list(map(lambda x: x.split('//')[1], df['main_url']))

# Apply list_check to your dataframe using emails and domain columns
df['emails'] = list(map(lambda x, y: list_check(x, y), df['emails'], df['domain']))

# Drop domain column
df.drop(columns=['domain'], inplace=True)

list_check function checks whether your match string is in the emails list and gets indexes of matches, then gets values from the emails list using matched indexes and returns those values in a list.

Output:

source for getting matched indexes

Sorry, I am unable to make it work as I do not have any separate column named "domain". also the domain name should extracted from the main_url values. Thanks for the try :) — Shihab Ullah, Apr 12 '22 at 10:58

score 1 · Answer 2 · answered Apr 12 '22 at 11:12

You can try to explode the mails column then compare with main_url column

df_ = URL_WITH_EMAILS_DF.explode('emails').reset_index()
m = (df_['main_url'].str.split('//').str[1] == df_['emails'].str.split('@').str[1])
df_ = df_[m].groupby('main_url').agg({'emails':lambda x: list(x)})

print(df_)
                                                  emails
main_url
http://keilstruplund.dk      [kasserer@keilstruplund.dk]
http://kirsebaergaarden.com  [info@kirsebaergaarden.com]
http://koglernes.dk                  [info@koglernes.dk]

At last, map the result to original dataframe

URL_WITH_EMAILS_DF['matched'] = URL_WITH_EMAILS_DF['main_url'].map(df_['emails']).fillna("").apply(list)

print(URL_WITH_EMAILS_DF[['main_url', 'matched']])

                         main_url                      matched
0         http://keilstruplund.dk  [kasserer@keilstruplund.dk]
1     http://kirsebaergaarden.com  [info@kirsebaergaarden.com]
2             http://koglernes.dk          [info@koglernes.dk]
3  http://kongehojensbornehave.dk                           []

When its millions of row, how is your answer better performance wise? — Shihab Ullah, Apr 12 '22 at 11:42

Get multiple column value based on partial matching with another column value for pandas dataframe

2 Answers2