-3

I have a list of emails about 10.000 Long, with incomplete emails id, due to data unreliability and would like to know how can I complete them using python.

sample emails:

xyz@gmail.co
xyz@gmail.
xyz@gma
xyz@g

I've tried using validate_email package to filter out bad emails and have tried various regex patterns and I end up with xyz@gmail.com.co similar to search and replace using sublime text. I think there is a better way to this than regex and would like to know.

user229044
  • 232,980
  • 40
  • 330
  • 338
Nikhil
  • 1,268
  • 2
  • 13
  • 29
  • I would be surprised if any library solved that OOTB - this looks like a very localized problem to me. – d33tah May 03 '15 at 11:38
  • How would this code be able to decide whether to complete `xyx@g` to `@gmail` or `@gotmail`? – Lix May 03 '15 at 11:40
  • Okay, I agree. Still how do you think we can solve this. – Nikhil May 03 '15 at 11:40
  • @Nikhil: you are showing to research effort. Please show us what you tried or I believe it would be appropriate to flag it for closing as "unclear". – d33tah May 03 '15 at 11:40
  • @Lix g or gm or gmail. everything goes to gmail.com – Nikhil May 03 '15 at 11:41
  • @Nikhil - that is quite an assumption to make don't you think? Are you certain that the only email domain listed in your DB that starts with `g` is gmail? – Lix May 03 '15 at 11:43
  • @d33tah I am not very learned in this and hence the question here. First using validate_email I filtered out all the invalid emails with the regular pattern which I have mentioned above. Then I tried various regex patterns without any success and hence again the question here. – Nikhil May 03 '15 at 11:47
  • @Nikhil: Hm, that puts it in a different perspective. Please update your question describing what you tried. – d33tah May 03 '15 at 11:59
  • @d33tah alright cool.. thanks!! also give me some ideas.. – Nikhil May 03 '15 at 12:07

2 Answers2

2

A strategy to consider is to build a "trie" data structure for the domains that you have such as gma and gmail.co. Then where a domain is a prefix of one other domain, you can consider going down the longer branch of the trie if there is a unique such branch. This will mean in your example replacing gma ultimately with gmail.co.

There is an answer concerning how to create a trie in Python.

Community
  • 1
  • 1
minopret
  • 4,726
  • 21
  • 34
0
def email_check():
    fo = open("/home/cam/Desktop/out.dat", "rw+") #output file
    with open('/home/cam/Desktop/email.dat','rw') as f:
        for line in f:

        at_pos=line.find('@')


        if line[at_pos + 1] == 'g':
            line=line[:at_pos+1]+'gmail.com'
        elif line[at_pos +1] ==  'y':
            line=line[:at_pos+1]+'yahoomail.com'
        elif line[at_pos + 1] == 'h':
            line=line[:at_pos+1]+'hotmail.com'


        fo.write(line)
        fo.write('\n')
    f.close()

email_check()
Ajay
  • 5,267
  • 2
  • 23
  • 30