1

I have a pandas data frame, which has a column of the hostname of each email address (over 1000 rows):

email               hostname
email@example.com   example.com
email2@example.com  example.com
email@example2.com  example2.com
email@example3.com  example3.com

I want to go through each hostname and check whether it truly exists or not.

email               hostname      valid_hostname
email@example.com   example.com   True
email2@example.com  example.com   False
email@example2.com  example2.com  False
email@example3.com  example3.com  False

First, I extracted the hostname of each email address:

df['hostname'] = df['email'].str.split('@').str[1]

Then, I tried to check the DNS using pyIsEmail, but that was too slow:

from pyisemail import is_email    
df['valid_hostname'] = df['hostname'].apply(lambda x: is_email(x, check_dns=True))

Then, I tried a multi-threaded function:

import requests
from requests.exceptions import ConnectionError

def validate_hostname_existence(hostname:str):
    try:
        response = requests.get(f'http://{hostname}', timeout=0.5)
    except ConnectionError:
        return False
    else:
        return True

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor() as executor:
        df['valid_hostname'] = pd.Series(executor.map(validate_hostname_existence, df['hostname']),index=df['hostname'].index)

But that did not go so well, too, as I'm pretty new to parallel functions. It has multiple errors, and I believe it can be much more beneficial if I could somehow first check whether this hostname got checked already and skip the entire request all over again. I would like to go as far as I can without actually sending an email.

Is there a library or a way to accomplish this? As I could not find a proper solution to this problem so far.

Erelephant
  • 111
  • 9
  • 1
    Well sending a get request verifies they have a website not that they have email setup so I'd ditch that approach. You can have a website and not email and vice versa. How much validation to you want to do? 1. Syntax this email could be valid 2. DNS this domain has email enabled 3. address, this particular address @ this domain accepts mail. For 3 you really just have to send an email. – nlta Nov 22 '21 at 08:13
  • I appreciate your clarification request, @nlta. I would like to go as fas as I can without actually sending an email. – Erelephant Nov 22 '21 at 08:25

2 Answers2

0

You answered your self, You can use cache to save in the memory the hostnames you already checked.

for example:

   from functools import lru_cache
   @lru_cache(max_size=None) 
   def my_is_email(x, check_dns=True):
       return is_email(x, check_dns=check_dns)

It's also recommended to limit the size to prevent memory overflow. for example:

@lru_cache(max_size=256) 

for more information read This

  • That did not help with my long list (pandas Serie) of domains to check (over 1000), as it still takes way too long (around 5 minutes and counting...) to get the result. – Erelephant Nov 22 '21 at 13:44
0

I don't know anything about panda but here's how you can process a list of emails in parallel and get back of a set of valid emails. I'm sure you can adapt this to your panda case.

from queue import Empty, Queue
from threading import Thread, Lock
from pyisemail import is_email

q = Queue()
lock = Lock()
valid = set()

emails = ["test@google.com", "test2@gmail.com"]
for e in emails:
    q.put(e)


def process_queue(queue: Queue):
    while True:
        try:
            email = queue.get(block=False)
        except Empty:
            break
        if is_email(email, check_dns=True):
            lock.acquire()
            valid.add(email)
            lock.release()


NUM_THREADS = 30
threads = []

for i in range(NUM_THREADS):
    thread = Thread(target=process_queue, args=(q,))
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()

print("done")
print(valid)

Explanation

  1. Create a queue object filled with emails
  2. Create NUM_THREADS threads.
  3. Each thread pulls from queue. If they get an email it process the email. locks the lock protecting the results set. adds to the set. releases. If there are no emails left the thread terminates.
  4. Wait for all threads to terminate.
nlta
  • 1,716
  • 1
  • 7
  • 17
  • Thank you very much for your help, and I apologize for the late response. This solution partially worked because when I printed off a list of over 10K emails, it came back with only 833 of them (I made sure to adjust your code and print both valid and invalid emails). – Erelephant Dec 07 '21 at 15:01
  • @Erelephant great if the problem is solved please accept the solution. If not let us know what needs to change. – nlta Dec 08 '21 at 00:28
  • While I was unable to debug it and find the cause, using this code snippet with an additional set() named 'invalid' did not return a result for ALL of the email addresses. – Erelephant Dec 08 '21 at 06:43
  • link your code? or include it in the post? – nlta Dec 08 '21 at 07:06