1

I have data frame like this:

**Domain**         **URL**  
Amazon         amazon.com/xyz/butter
Amazon         amazon.com/xyz/orange
Facebook       facebook.com/male
Google         google.com/airport
Google         goolge.com/car

Its just an imaginary data. I have clickstream data where i want to use "Domain" and "URL" columns. Actually i have list of many keyword which i saved in dictionary and i need to search it in url and then extract it to create new column.

I have dictionary like this:

dict_keyword = {'Facebook': ['boy', 'girl', 'man'], 'Google': ['airport', 'car', 'konfigurator'], 'Amazon': ['apple', 'orange', 'butter']

I want to obtain output like this:

  **Domain**         **URL**                     Keyword
    Amazon         amazon.com/xyz/butter         butter
    Amazon         amazon.com/xyz/orange         orange
    Facebook       facebook.com/male             male
    Google         google.com/airport            airport
    Google         goolge.com/car                car

So far i want to do just with one line of code. I am trying to use

df['Keyword'] = df.apply(lambda x: any(substring in x.URL for substring in dict_config[x.Domain]) ,axis =1)

I am getting only Boolean value but i want to return the keyword. Any help?

s_khan92
  • 969
  • 8
  • 21

1 Answers1

1

Idea is add filtration with if to end of list comprehension and also added next with iter for return default value if no match:

f = lambda x: next(iter([sub for sub in dict_config[x.Domain] if sub in x.URL]), 'no match')
df['Keyword'] = df.apply(f, axis=1)
print (df)
     Domain                    URL   Keyword
0    Amazon  amazon.com/xyz/butter    butter
1    Amazon  amazon.com/xyz/orange    orange
2  Facebook      facebook.com/male  no match
3    Google     google.com/airport   airport
4    Google         goolge.com/car       car

If possible not match also first Domain column solution is changed with .get for lookup with default value:

print (df)
     Domain                    URL
0    Amazon  amazon.com/xyz/butter
1    Amazon  amazon.com/xyz/orange
2  Facebook      facebook.com/male
3    Google     google.com/airport
4   Google1         goolge.com/car <- changed last value to Google1

dict_config = {'Facebook': ['boy', 'girl', 'man'], 
               'Google': ['airport', 'car', 'konfigurator'],
               'Amazon': ['apple', 'orange', 'butter']}

f = lambda x: next(iter([sub for sub in dict_config.get(x.Domain, '') 
                         if sub in x.URL]), 'no match')
df['Keyword'] = df.apply(f, axis=1)
     Domain                    URL   Keyword
0    Amazon  amazon.com/xyz/butter    butter
1    Amazon  amazon.com/xyz/orange    orange
2  Facebook      facebook.com/male  no match
3    Google     google.com/airport   airport
4   Google1         goolge.com/car  no match
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    You are always life saver. Amazing @jezrael Can you provide best sources where i can learn more about these next, iter, lambda operations? – s_khan92 Oct 31 '19 at 09:44
  • 1
    @MuhammadSalmanShahid - this is more pure python way, so you can check [this](https://stackoverflow.com/questions/16814984/python-list-iterator-behavior-and-nextiterator). – jezrael Oct 31 '19 at 09:50