I have a pandas data frame, which has a column of the hostname of each email address (over 1000 rows):
email hostname
email@example.com example.com
email2@example.com example.com
email@example2.com example2.com
email@example3.com example3.com
I want to go through each hostname and check whether it truly exists or not.
email hostname valid_hostname
email@example.com example.com True
email2@example.com example.com False
email@example2.com example2.com False
email@example3.com example3.com False
First, I extracted the hostname of each email address:
df['hostname'] = df['email'].str.split('@').str[1]
Then, I tried to check the DNS using pyIsEmail
, but that was too slow:
from pyisemail import is_email
df['valid_hostname'] = df['hostname'].apply(lambda x: is_email(x, check_dns=True))
Then, I tried a multi-threaded function:
import requests
from requests.exceptions import ConnectionError
def validate_hostname_existence(hostname:str):
try:
response = requests.get(f'http://{hostname}', timeout=0.5)
except ConnectionError:
return False
else:
return True
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor() as executor:
df['valid_hostname'] = pd.Series(executor.map(validate_hostname_existence, df['hostname']),index=df['hostname'].index)
But that did not go so well, too, as I'm pretty new to parallel functions. It has multiple errors, and I believe it can be much more beneficial if I could somehow first check whether this hostname got checked already and skip the entire request all over again. I would like to go as far as I can without actually sending an email.
Is there a library or a way to accomplish this? As I could not find a proper solution to this problem so far.