Vectorization of pandas dataframe to split domains, with multiple if-else clauses

Question

Please help make the following pandas dataframe code vectorized/ faster, it's very slow.

I have the below code, which works exactly as I want. It takes domains with lots of subdomains & normalizes them to just the hostname + TLD.

I can't find any vectorization examples using if-else statements.

import pandas as pd
import time
#import file into dataframe

start = time.time()
path = "Desktop/dom1.csv"

df = pd.read_csv(path, delimiter=',', header='infer', encoding = "ISO-8859-1")

#strip out all ---- values
df2 = df[((df['domain'] != '----'))]

#extract only 2 columns from dataframe
df3 = df2[['domain', 'web.optimisedsize']]

#define tld and cdn lookup lists
tld = ['co.uk', 'com', 'org', 'gov.uk', 'co', 'net', 'news', 'it', 'in' 'es', 'tw', 'pe', 'io', 'ca', 'cat', 'com.au',
  'com.ar', 'com.mt', 'com.co', 'ws', 'to', 'es', 'de', 'us', 'br', 'im', 'gr', 'cc', 'cn', 'org.uk', 'me', 'ovh', 'be',
  'tv', 'tech', '..', 'life', 'com.mx', 'pl', 'uk', 'ru', 'cz', 'st', 'info', 'mobi', 'today', 'eu', 'fi', 'jp', 'life',
  '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'earth', 'ninja', 'ie', 'im', 'ai', 'at', 'ch', 'ly', 'market', 'click',
  'fr', 'nl', 'se']
cdns = ['akamai', 'maxcdn', 'cloudflare']

#iterate through each row of the datafrme and split each domain at the dot
for row in df2.itertuples():
  index = df3.domain.str.split('.').tolist()
  cleandomain = []
  #iterate through each of the split domains
  for x in index:
    #if it isn't a string, then print the value directly in the cleandomain list
    if not isinstance(x, str):
        cleandomain.append(str(x))
    #if it's a string that encapsulates numbers, then it's an IP
    elif str(x)[-1].isnumeric():
        try:
            cleandomain.append(str(x[0])+'.'+str(x[1])+'.*.*')
        except IndexError:
            cleandomain.append(str(x))
    #if its in the CDN list, take a subdomain as well
    elif len(x) > 3 and str(x[len(x)-2]).rstrip() in cdns:
        try:
            cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    elif len(x) > 3 and str(x[len(x)-3]).rstrip() in cdns:
        try:
            cleandomain.append(str(x[len(x)-4])+'.'+str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    #if its in the TLD list, do this
    elif len(x) > 2 and str(x[len(x)-2]).rstrip()+'.'+ str(x[len(x)-1]).rstrip() in tld:
        try:
            cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    elif len(x) > 2 and str(x[len(x)-1]) in tld:
        try:
            cleandomain.append(str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    #if its not in the TLD list, do this
    else:
      cleandomain.append(str(x))

#add the column to the dataframe
df3['newdomain2']=cleandomain
se = pd.Series(cleandomain)
df3['newdomain2'] = se.values

#select only the new domain column & usage
df4 = df3[['newdomain2', 'web.optimisedsize']]

#group by
df5 = df4.groupby(['newdomain2'])[['web.optimisedsize']].sum()

#sort
df6 = df5.sort_values(['web.optimisedsize'], ascending=["true"])
end = time.time()
print(df6)
print(end-start)

My input is this DF:

In [4]: df
Out[4]:
                     Domain      Use
0        graph.facebook.com     4242
1            news.bbc.co.uk    23423
2  news.more.news.bbc.co.uk   234432
3       profile.username.co   235523
4           offers.o2.co.uk   235523
5     subdomain.pyspark.org     2325
6       uds.data.domain.net    23523
7         domain.akamai.net    23532
8           333.333.333.333  3432324

During, the index splits it to this:

[['graph', 'facebook', 'com'], ['news', 'bbc' .....

I then append the new domain to the original dataframe as a new column. This then gets grouped by + summed to create the final dataframe.

In [10]: df
Out[10]:
                     Domain      Use         newdomain
0        graph.facebook.com     4242       facebook.com
1            news.bbc.co.uk    23423          bbc.co.uk
2  news.more.news.bbc.co.uk   234432          bbc.co.uk
3       profile.username.co   235523        username.co

Please add a sample dataframe to your question. [Read](https://stackoverflow.com/help/mcve) — Vishnudev Krishnadas, Oct 21 '18 at 12:31
I'll try to extract a test dataset - the whole thing is 100M rows & very big|! — kikee1222, Oct 21 '18 at 13:09
just pass the parameter ", nrows=1000" when you "read_csv" files so can generate a sample data. — n1tk, Oct 21 '18 at 13:30
Never do `x[len(x)-2])`, in Python we have [negative indexing](https://stackoverflow.com/questions/11367902/negative-list-index) so you can simply do `x[-2]`, it's clearer and faster. — smci, Nov 10 '19 at 18:16
All five `try-except` clauses (and also the case `if not isinstance(x, str)`) have the exact same except clause: `except IndexError: cleandomain.append(str(x))`, so you can just move the `try`-statement to outside the `if-else` ladder, and the `except` statement to below the ladder. — smci, Nov 10 '19 at 18:20

score 2 · Accepted Answer · answered Oct 21 '18 at 21:15

One of the problems is that in every iteration you execute you have index = df3.domain.str.split('.').tolist(). When I put this line outside of the loop the calculation is 2 times faster. 587ms VS 1.1s.

I also think that your code is wrong. You do not use the row variable and use index instead. And when you iterate index one element is always a list. So if not isinstance(x, str) is always True. (You can see it in line_debugger output below)

String operations are generally not vectorizable. Even the .str notation is in reality a python loop.

And here is an output of line_debugger tool in Jupyter notebook: Initialization (f is a function wrapped around the code):

%load_ext line_profiler
%lprun -f f f(df2, df3)

Output:

Total time: 1.82219 s
File: <ipython-input-8-79f01a353d31>
Function: f at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def f(df2,df3):
     2         1       8093.0   8093.0      0.2      index = df3.Domain.str.split('.').tolist()
     3                                               #iterate through each row of the datafrme and split each domain at the dot
     4       901      11775.0     13.1      0.2      for row in df2.itertuples():
     5                                           
     6       900      26241.0     29.2      0.5        cleandomain = []
     7                                                 #iterate through each of the split domains
     8    810900     971082.0      1.2     18.8        for x in index:
     9                                                   #if it isn't a string, then print the value directly in the cleandomain list
    10    810000    1331253.0      1.6     25.8          if not isinstance(x, str):
    11    810000    2819163.0      3.5     54.6              cleandomain.append(str(x))
    12                                                   #if it's a string that encapsulates numbers, then it's an IP
    13                                                   elif str(x)[-1].isnumeric():
    14                                                       try:
    15                                                           cleandomain.append(str(x[0])+'.'+str(x[1])+'.*.*')
    16                                                       except IndexError:
    17                                                           cleandomain.append(str(x))
    18                                                   #if its in the CDN list, take a subdomain as well
    19                                                   elif len(x) > 3 and str(x[len(x)-2]).rstrip() in cdns:
    20                                                       try:
    21                                                           cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+str(x[len(x)-1]))
    22                                                       except IndexError:
    23                                                           cleandomain.append(str(x))
    24                                                   elif len(x) > 3 and str(x[len(x)-3]).rstrip() in cdns:
    25                                                       try:
    26                                                           cleandomain.append(str(x[len(x)-4])+'.'+str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
    27                                                       except IndexError:
    28                                                           cleandomain.append(str(x))
    29                                                   #if its in the TLD list, do this
    30                                                   elif len(x) > 2 and str(x[len(x)-2]).rstrip()+'.'+ str(x[len(x)-1]).rstrip() in tld:
    31                                                       try:
    32                                                           cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
    33                                                       except IndexError:
    34                                                           cleandomain.append(str(x))
    35                                                   elif len(x) > 2 and str(x[len(x)-1]) in tld:
    36                                                       try:
    37                                                           cleandomain.append(str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
    38                                                       except IndexError:
    39                                                           cleandomain.append(str(x))
    40                                                   #if its not in the TLD list, do this
    41                                                   else:
    42                                                     cleandomain.append(str(x))

My code:
Data preparation:

from io import StringIO
import pandas as pd
#import file into dataframe
TESTDATA=StringIO("""Domain,Use
      graph.facebook.com,   4242
          news.bbc.co.uk,  23423
news.more.news.bbc.co.uk, 234432
     profile.username.co, 235523
         offers.o2.co.uk, 235523
   subdomain.pyspark.org,   2325
     uds.data.domain.net,  23523
       domain.akamai.net,  23532
         333.333.333.333,3432324
""")
df=pd.read_csv(TESTDATA)
df["Domain"] = df.Domain.str.strip()
df = pd.concat([df]*100)

df2 = df
#extract only 2 columns from dataframe
df3 = df2
#define tld and cdn lookup lists
tld = ['co.uk', 'com', 'org', 'gov.uk', 'co', 'net', 'news', 'it', 'in' 'es', 'tw', 'pe', 'io', 'ca', 'cat', 'com.au',
  'com.ar', 'com.mt', 'com.co', 'ws', 'to', 'es', 'de', 'us', 'br', 'im', 'gr', 'cc', 'cn', 'org.uk', 'me', 'ovh', 'be',
  'tv', 'tech', '..', 'life', 'com.mx', 'pl', 'uk', 'ru', 'cz', 'st', 'info', 'mobi', 'today', 'eu', 'fi', 'jp', 'life',
  '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'earth', 'ninja', 'ie', 'im', 'ai', 'at', 'ch', 'ly', 'market', 'click',
  'fr', 'nl', 'se']
cdns = ['akamai', 'maxcdn', 'cloudflare']

Timing in jupyter notebook:

%%timeit
index = df3.Domain.str.split('.').tolist()
#iterate through each row of the datafrme and split each domain at the dot
for row in df2.itertuples():

  cleandomain = []
  #iterate through each of the split domains
  for x in index:
    #if it isn't a string, then print the value directly in the cleandomain list
    if not isinstance(x, str):
        cleandomain.append(str(x))
    #if it's a string that encapsulates numbers, then it's an IP
    elif str(x)[-1].isnumeric():
        try:
            cleandomain.append(str(x[0])+'.'+str(x[1])+'.*.*')
        except IndexError:
            cleandomain.append(str(x))
    #if its in the CDN list, take a subdomain as well
    elif len(x) > 3 and str(x[len(x)-2]).rstrip() in cdns:
        try:
            cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    elif len(x) > 3 and str(x[len(x)-3]).rstrip() in cdns:
        try:
            cleandomain.append(str(x[len(x)-4])+'.'+str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    #if its in the TLD list, do this
    elif len(x) > 2 and str(x[len(x)-2]).rstrip()+'.'+ str(x[len(x)-1]).rstrip() in tld:
        try:
            cleandomain.append(str(x[len(x)-3])+'.'+str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    elif len(x) > 2 and str(x[len(x)-1]) in tld:
        try:
            cleandomain.append(str(x[len(x)-2])+'.'+ str(x[len(x)-1]))
        except IndexError:
            cleandomain.append(str(x))
    #if its not in the TLD list, do this
    else:
      cleandomain.append(str(x))

You are awesome! That's bought my code execution from 2.9 seconds (on 10k rows) down to 0.3! Thanks! — kikee1222, Oct 22 '18 at 07:31

Vectorization of pandas dataframe to split domains, with multiple if-else clauses

1 Answers1