Pandas: create new column if value from one column contains a substring

Question

I have dataframe

member_id,event_type,event_path,event_time,event_date,event_duration
20077,2016-11-20,"2016-11-20 09:17:07",url,e.mail.ru/message/14794236680000000730/,0
20077,2016-11-20,"2016-11-20 09:17:07",url,e.mail.ru/message/14794236680000000730/,2
20077,2016-11-20,"2016-11-20 09:17:09",url,avito.ru/profile/messenger/channel/u2i-558928587-101700461?utm_source=avito_mail&utm_medium=email&utm_campaign=messenger_single&utm_content=test,1
20077,2016-11-20,"2016-11-20 09:17:37",url,avito.ru/auto/messenger/channel/u2i-558928587-101700461?utm_source=avito_mail&utm_medium=email&utm_campaign=messenger_single&utm_content=test,135
20077,2016-11-20,"2016-11-20 09:19:53",url,e.mail.ru/message/14794236680000000730/,0
20077,2016-11-20,"2016-11-20 09:19:53",url,e.mail.ru/message/14794236680000000730/,37

and have another df2

domain  category    subcategory unique id   count_sec   Main category   Subcategory
avito.ru/auto   Автомобили Авто 1600    83112396    Auto  Avito
youtube.com Видеопортал Видеохостинг    1317    42710996    Video   Youtube
ok.ru   Развлечения     Социальные сети 694 13394605    Social network  OK
kinogo.club Развлечения     Кино    497 8438800 Video   Illegal
e.mail.ru   Почтовый сервис None    1124    8428984 Mail.ru Email
vk.com/audio    Видеопортал Видеохостинг    1020    7409440 Music   VK

Usually I use:

df['category'] = df.event_date.map(df2.set_index('domain')['Main category']

But it's compare data and if it's equal, it takes values and create it in new column. But how can I do the same, but use if substring in string?

Possible duplicate of [is it possible to do fuzzy match merge with python pandas?](http://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) — maxymoo, Dec 15 '16 at 22:40

score 0 · Answer 1 · answered Dec 15 '16 at 22:43

0

I really can't tell what exactly you are trying to do. But my suggestion would be something like this:

mapping = dict(df2.set_index('domain')['Main category'])

def map_to_substring(x):
    for key in mapping.keys():
        if x in key:
            return mapping[key]
    return ''

df['category'] = df.event_date.apply(lambda x: map_to_substring(x))

Test on a subsection of df as it may take a while depending on how much data you have.

answered Dec 15 '16 at 22:43

Alex

12,078
6
64
74

is any way to make it faster? I have 500 thousands str in event_date – Petr Petrov Dec 15 '16 at 23:21

score 0 · Answer 2 · answered Dec 16 '16 at 00:32

Without any heuristic for discovering fuzzy matches on which to join, you won't have a scalable solution, since you'll need to make O(N²) comparisons.

For your specific use case, I'd suggest extracting the part of the URL you do want to compare on. Maybe something like

from urlparse import urlparse

def netloc(s):
    return urlparse('http://' + s).netloc

df['netloc'] = df['event_date'].apply(netloc)
df2['netloc'] = df2['domain'].apply(netloc)

df.merge(df2, 'left', on='netloc')

Pandas: create new column if value from one column contains a substring

2 Answers2