0

I am using Cisco umbrella top 1 million. I found that it includes a lot of sub domains. google.com alone contains 2400+ domains. I want to remove sub domains from Cisco umbrella top 1 million and want to see how many domains are left in the file

Is there any bash command which removes sub strings, i.e if the input file contains

google.com
play.google.com 
drive.google.com

the result should be like

google.com

Secondly, I tried following python code. It took a lot of time as it checks every domain against 1 million domains

import csv
domain_list=[]
import json
with open("~/Downloads/1/top-1m.csv", "r") as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        domain_list.append(row[1])
multiple_domain=dict()
count=0
total_iter_json=0

for domain in domain_list:
    count=count+1
    print(count)
    res = [i for i in domain_list if '.'+domain in i] 
    if(len(res)>1):
        result=[]
        result.append(len(res))
        result.extend(res)
        multiple_domain[domain]=result

What can I do?

  • You might be interrested in this question : [Extracting top-level and second-level domain from a URL using regex](https://stackoverflow.com/questions/21173734/extracting-top-level-and-second-level-domain-from-a-url-using-regex) – Lescurel Mar 09 '20 at 11:01
  • @Lescurel Let me cross check By the way thanks for time – Programmer99 Mar 09 '20 at 11:13
  • For the replacement, the `sed` command in bash should do the work (and is quite fast already). For the second part, I strongly advise you to use the python module called `pandas` as is design to be more user-friendly and faster to compute dataframes than plain python loops. Is it ok for you to use it? – Jérôme Richard Mar 09 '20 at 11:29
  • Jerome Richard That totally Make Sense Pandas is great option i will try that Secondly @Lescurel thanks also for your input its solve 95 % problem – Programmer99 Mar 09 '20 at 12:04

0 Answers0