4

I have a very long list of emails that I would like to process to:

  1. separate good emails from bad emails, and
  2. remove duplicates but keep all the non-duplicates in the same order.

This is what I have so far:

email_list = ["joe@example.com", "invalid_email", ...]
email_set = set()
bad_emails = []
good_emails = []
dups = False
for email in email_list:
    if email in email_set:
        dups = True
        continue
    email_set.add(email)
    if email_re.match(email):
        good_emails.append(email)
    else:
        bad_emails.append(email)

I would like this chunk of code to be as fast as possible, and of less importance, to minimize memory requirements. Is there a way to improve this in Python? Maybe using list comprehensions or iterators?

EDIT: Sorry! Forget to mention that this is Python 2.5 since this is for GAE.

email_re is from django.core.validators

new name
  • 15,861
  • 19
  • 68
  • 114

2 Answers2

2

Look at: Does Python have an ordered set? , and select an implementation you like.

So just:

email_list = OrderedSet(["joe@example.com", "invalid_email", ...])

bad_emails = [] 
good_emails = []

for email in email_list:
    if email_re.match(email):
        good_emails.append(email)
    else:
        bad_emails.append(email)

Probably is the fastest and simpliest solution you can achieve.

Community
  • 1
  • 1
utdemir
  • 26,532
  • 10
  • 62
  • 81
2

I can't think of any way to speed up what you have. It's fast to use a set to keep track of things, and it's fast to use a list to store a list.

I like the OrderedSet solution, but I doubt a Python implementation of OrderedSet would be faster than what you wrote.

You could use an OrderedDict to solve this problem. But that was added for Python 2.7. You could use a recipe (like: http://code.activestate.com/recipes/576693/) to add OrderedDict but again I don't think it would be any faster than what you have.

I'm trying to think of a Python module that is implemented in C to solve this problem. I think that's the only hope of beating your code. But I haven't thought of anything.

If you can get rid of the dups flag, it will be faster simply by running less Python code.

Interesting question. Good luck.

steveha
  • 74,789
  • 21
  • 92
  • 117