5

I am using the tld python library to grab the first level domain from the proxy request logs using a apply function. When I run into a strange request that tld doesnt know how to handle like 'http:1 CON' or 'http:/login.cgi%00' I run into an error message like the following:

TldBadUrl: Is not a valid URL http:1 con!
TldBadUrlTraceback (most recent call last)
in engine
----> 1 new_fld_column = request_2['request'].apply(get_fld)

/usr/local/lib/python2.7/site-packages/pandas/core/series.pyc in apply(self, func, convert_dtype, args, **kwds)
   2353             else:
   2354                 values = self.asobject
-> 2355                 mapped = lib.map_infer(values, f, convert=convert_dtype)
   2356 
   2357         if len(mapped) and isinstance(mapped[0], Series):

pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer (pandas/_libs/lib.c:66440)()

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in get_fld(url, 
fail_silently, fix_protocol, search_public, search_private, **kwargs)
    385         fix_protocol=fix_protocol,
    386         search_public=search_public,
--> 387         search_private=search_private
    388     )
    389 

/home/cdsw/.local/lib/python2.7/site-packages/tld/utils.pyc in process_url(url, fail_silently, fix_protocol, search_public, search_private)
    289             return None, None, parsed_url
    290         else:
--> 291             raise TldBadUrl(url=url)
    292 
    293     domain_parts = domain_name.split('.')

To overcome this it was suggested to me to wrap the function in a try-except clause to determine the rows that error out by querying them with NaN:

import tld
from tld import get_fld

def try_get_fld(x):
    try: 
        return get_fld(x)
    except tld.exceptions.TldBadUrl: 
        return np.nan

This seems to work for some of the "requests" like "http:1 con" and "http:/login.cgi%00" but then fails for "http://urnt12.knhc..txt/" where I get another error message like the one above:

TldDomainNotFound: Domain urnt12.knhc..txt didn't match any existing TLD name!

This is what the dataframe looks like total of 240,000 "requests" in a dataframe called "request":

request
  request                                      count
0 https://login.microsoftonline.com            24521
1 https://dt.adsafeprotected.com               11521
2 https://googleads.g.doubleclick.net          6252
3 https://fls-na.amazon.com                    65225
4 https://v10.vortex-win.data.microsoft.com    7852222
5 https://ib.adnxs.com                         12
6 http:1 CON                                   6 
7 http:/login.cgi%00                           45822
8 http://urnt12.knhc..txt/                     1 

My code:

from tld import get_tld
from tld import get_fld
import pandas as pd
import numpy as np
#Read back into to dataframe
request = pd.read_csv('Proxy/Proxy_Analytics/Request_Grouped_By_Request_Count_12032018.csv')
#Remove rows where there were null values in the request column 
request = request[pd.notnull(request['request'])]
#Find the urls that contain IP addresses and exclude them from the new dataframe
request = request[~request.request.str.findall(r'[0-9]+(?:\.[0-9]+){3}').astype(bool)]
#Reset index
request = request.reset_index(drop=True)

import tld
from tld import get_fld

def try_get_fld(x):
    try: 
        return get_fld(x)
    except tld.exceptions.TldBadUrl: 
        return np.nan

request['flds'] = request['request'].apply(try_get_fld)

#faulty_url_df = request[request['flds'].isna()]
#print(faulty_url_df)
sectechguy
  • 2,037
  • 4
  • 28
  • 61
  • 1
    I would first try a general exception catcher. like except Exception as e: To be sure that it is not other exception than your expected tld.exceptions.TldBadUrl. – Arka Mallick Dec 05 '18 at 14:21

1 Answers1

5

It fails because it's a different exception. You expect a tld.exceptions.TldBadUrl: exception but get a TldDomainNotFound

You can either be less specific in your except clause and catch more exception with one except clause or add another except clause to catch the other type of exception:

try: 
    return get_fld(x)
except tld.exceptions.TldBadUrl: 
    return np.nan
except tld.exceptions.TldDomainNotFound:
    print("Domain not found!")
    return np.nan
Bernhard
  • 1,253
  • 8
  • 18