I am using Cisco umbrella top 1 million. I found that it includes a lot of sub domains. google.com alone contains 2400+ domains. I want to remove sub domains from Cisco umbrella top 1 million and want to see how many domains are left in the file
Is there any bash command which removes sub strings, i.e if the input file contains
google.com
play.google.com
drive.google.com
the result should be like
google.com
Secondly, I tried following python code. It took a lot of time as it checks every domain against 1 million domains
import csv
domain_list=[]
import json
with open("~/Downloads/1/top-1m.csv", "r") as f:
reader = csv.reader(f, delimiter=',')
for row in reader:
domain_list.append(row[1])
multiple_domain=dict()
count=0
total_iter_json=0
for domain in domain_list:
count=count+1
print(count)
res = [i for i in domain_list if '.'+domain in i]
if(len(res)>1):
result=[]
result.append(len(res))
result.extend(res)
multiple_domain[domain]=result
What can I do?