0

I have a click stream data. I am using URL column to find out special event. For example if the URL contains keyword Dealer then the new column will be created "Is Dealer" which gives Boolean value.

df sample:

**df example:**

Dictionary: I have a dictionary where key is "Domain" and values are the list of keywords (keyword must be checked in URL"

brand_dict = {'volkswagen': ['haendlersuche'], 'mercedes-benz': ['dealer-locator'], 'skoda-auto': ['dealers']}

I need to check 2 condition in other columns first: If Domains column = "BMW" and it contains any keyword from the list of dictionary then it gives boolean value in new column.

The problem is that i have to create 3 columns and I have 3 dictionaries. Any special way to do this?

So far i am doing this:

 def conv_attribution(domain, url):

        list_output = []

        if domain in dict_config.keys():


            bolcheck1 = False
            for keyword in dict_config[domain]:
                if keyword in url:
                    bolcheck1 = True

            bolcheck2 = False
            for keyword in dict_dealer[domain]:
                if keyword in url:
                    bolcheck2 = True  

            bolcheck3 = False
            for keyword in dict_brand_keywords[domain]:
                if keyword in url:
                      bolcheck3 = True


            if bolcheck1 == True:
                list_output.append(True)
            else:
                list_output.append(False)

            if bolcheck2 == True:
                list_output.append(True)
            else:
                list_output.append(False)

            if bolcheck3 == True:
                list_output.append(keyword)
            else:
                list_output.append("Nan")


   return list_output

Please help...

Desired Output

The desired out would look like this but in Model Name, I want to add model name extracted from URL

enter image description here

s_khan92
  • 969
  • 8
  • 21

1 Answers1

0

here is a minimal example

import pandas as pd
domains = ['bmw','smart','smart','fiat','bmw']
urls = ['https://bmw.com/hello','https://smart.com/world','https://smart.com/hello','https://fiat.com/hello','https://bmw.com/hello']
df = pd.DataFrame({'domain':domains,'urls':urls})
# your config dict
brand_dict = {'bmw': ['hello'], 'smart': ['world'],'fiat':['hello']} 

sample df

    domain  urls
0   bmw     https://bmw.com/hello
1   smart   https://smart.com/world
2   smart   https://smart.com/hello
3   fiat    https://fiat.com/hello
4   bmw     https://bmw.com/hello

create new columns

df['col_1'] = df.apply(lambda x: any(substring in x.urls for substring in brand_dict[x.domain]) ,axis =1)
df['col_2'] = df.apply(lambda x: any(substring in x.urls for substring in brand_dict[x.domain]) ,axis =1)
df

new df

   domain   urls                    col_1   col_2
0   bmw     https://bmw.com/hello   True    True
1   smart   https://smart.com/world True    True
2   smart   https://smart.com/hello False   False
3   fiat    https://fiat.com/hello  True    True
4   bmw     https://bmw.com/hello   True    True
vumaasha
  • 2,765
  • 4
  • 27
  • 41