0

I have a following dataframe:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['ole.norlin@mail.dk', 'ole.gregersen@hk.dk', 'prima-rent@youseepost.dk', 'jb@rentind.dk', 'frisoren01@gmail.com','stigterndrup@gmail.com', 'psn@psn.dk', 'samuel@malerfirmaet-lykkebo.dk', 'jan@mundt-reklame.dk',  'nordsjalland@phonixtag.dk', 'jp@rudersdalmaleren.dk', 'vvs@hestetangen.dk', 'steenkragelund@mail.tele.dk', 'kasserer@keilstruplund.dk']},                                        
                                 {'main_url': 'http://kirsebaergaarden.com', 'emails': ['info@kirsebaergaarden.com','ghost1054@yahoo.com']},
                                 {'main_url': 'http://koglernes.dk', 'emails': ['info@koglernes.dk']},
                                  {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                               ])

However, I want to keep only those values for property named "emails" whose every element's value after '@' is same as the corresponding value of the 'main_url' property but after "http://" resulting the following data frame:

URL_WITH_EMAILS_DF = pd.DataFrame(data=[{'main_url': 'http://keilstruplund.dk', 'emails': ['kasserer@keilstruplund.dk']},                                        
                                 {'main_url': 'http://kirsebaergaarden.com', 'emails': ['info@kirsebaergaarden.com']},
                                 {'main_url': 'http://koglernes.dk', 'emails': ['info@koglernes.dk']},
                                  {'main_url': 'http://kongehojensbornehave.dk', 'emails': []}
                               ])

enter image description here

Any hints or approach is appreciable considering the fact that I have millions row to implement the transformation

2 Answers2

1

Give this a try I think it should be able to handle a few millions of rows.

def list_check(emails_list, email_match):
    match_indexes = [i for i, s in enumerate(emails_list) if email_match in s]
    return [emails_list[index] for index in match_indexes]

# Parse main_url to get domain column
df['domain'] = list(map(lambda x: x.split('//')[1], df['main_url']))

# Apply list_check to your dataframe using emails and domain columns
df['emails'] = list(map(lambda x, y: list_check(x, y), df['emails'], df['domain']))

# Drop domain column
df.drop(columns=['domain'], inplace=True)

list_check function checks whether your match string is in the emails list and gets indexes of matches, then gets values from the emails list using matched indexes and returns those values in a list.

Output:

output df

source for getting matched indexes

Onur Guven
  • 620
  • 4
  • 15
1

You can try to explode the mails column then compare with main_url column

df_ = URL_WITH_EMAILS_DF.explode('emails').reset_index()
m = (df_['main_url'].str.split('//').str[1] == df_['emails'].str.split('@').str[1])
df_ = df_[m].groupby('main_url').agg({'emails':lambda x: list(x)})
print(df_)
                                                  emails
main_url
http://keilstruplund.dk      [kasserer@keilstruplund.dk]
http://kirsebaergaarden.com  [info@kirsebaergaarden.com]
http://koglernes.dk                  [info@koglernes.dk]

At last, map the result to original dataframe

URL_WITH_EMAILS_DF['matched'] = URL_WITH_EMAILS_DF['main_url'].map(df_['emails']).fillna("").apply(list)
print(URL_WITH_EMAILS_DF[['main_url', 'matched']])

                         main_url                      matched
0         http://keilstruplund.dk  [kasserer@keilstruplund.dk]
1     http://kirsebaergaarden.com  [info@kirsebaergaarden.com]
2             http://koglernes.dk          [info@koglernes.dk]
3  http://kongehojensbornehave.dk                           []
Ynjxsjmh
  • 28,441
  • 6
  • 34
  • 52