-4

My requirement:

I have a 20GB txt file which is tab-delimited.I want to use PERL/AWK(or grep) to see if the email address in the 'nth column' is valid one or not.(Regex --->/^(\w|-|_|.)+\@((\w|-|_)+.)+[a-zA-Z]{2,}$/ should be ok, but no consecutuve '..' OR'underscores' eg: abc..cd@xyz.com should be invalid, also abc__cd@xyz.com should be invalid as well).If the email address is valid redirect it to valid_email.txt if invalid redirect it to invalid_email.txt.The emphasis is to catch all invalid email address - with better performance- as the file size will grow further at a future date.

Edit/Update:

Does the below piece of code do - which can catch atleast 99% of invalid email address formats?OR does it need any further modification? Kindly feel free to post your opinons and suggestions.

To pull out Valid Email ID

grep -E -o "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,6}\b" Raw_file.txt >Valid_Email_List.txt (where Rawfile.txt contains only email addresses)

Sreenath
  • 1
  • 1
  • Those are perfectly valid email addresses. And that regex has many false negatives and false positives. – Ignacio Vazquez-Abrams Dec 17 '15 at 03:39
  • 2
    Okay. Do you have a question? No one is going to write your code for you. – Jordan Running Dec 17 '15 at 03:40
  • I will! Just let me know where to send the invoice first... – Ian McGowan Dec 17 '15 at 03:41
  • :) I understand that Jordan. I am stuck up at finding the expression which exactly picks the valid email address, and also regex shouldn't look ugly.I just need the one line snippet of finding it and redirecting it to a txt file. – Sreenath Dec 17 '15 at 03:51
  • There is no good regex for validating email addresses - they are too complicated. You should take a look at [this](http://stackoverflow.com/a/201378/2767207) – Jojodmo Dec 17 '15 at 03:52
  • 1
    @Sreenath step 1 is you have to figure out how to as a question. That would include a description of your task, the specific question(s) you have, what you have tried so far and concise testable sample input and expected output. – Ed Morton Dec 17 '15 at 05:44
  • Use [Email::Valid](http://search.cpan.org/~rjbs/Email-Valid-1.198/lib/Email/Valid.pm). (Disclaimer: I've contributed to the module over the years, but it's a very good one) – stevieb Dec 17 '15 at 14:58

1 Answers1

0

You should not use regex to validate email addresses. For the most part, you don't need to fully validate email addresses by syntax--it isn't useful.

First, accept as valid any address which contains an @ character. This will reject 99% of "random noise."

Then, if you want to know if an address is truly valid, send an email to it! If you get a positive acknowledgement, such as the user clicking a verification link contained in the email, it is valid.

If you do it based on syntax alone, you will accept obviously-bad addresses like nobody@example.com. And you will accept email addresses from providers which have long since gone out of business (making the address "not usable" despite being syntactically "valid").

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • John, I have seen people entering abc#cv@xyz.com OR abc$cd@xyz.com --- want to avoid these, hence simply concentrating on '@' will not work for me.Moreover its a huge file of 20GB, hence dont want to try the 'acknowledgement method'.How about the performance using perl , is it better than grep ?Can we use a combination of PERL-AWK ? – Sreenath Dec 17 '15 at 04:16
  • Who cares if someone enters `abc#cv@xyz.com`? That email address may not be usable to contact them, but neither will be `abccv@xyz.com`. That is, you're not going to be able to turn "bad" email addresses into "good" ones this way, so why bother? What's the benefit? As for the performance of Perl, it doesn't matter, any of these tools will easily process 20 GB of text in less than an hour so it makes no difference which one you use in terms of performance. A simple `grep @` will be lightning fast and give you 99% of the benefits. – John Zwinck Dec 17 '15 at 04:18
  • No, my idea is not to make bad email address(text) to good one.But to identify incorrectly formatted (out of format) email id - and to conatct customer via SMS and ask them to update us with correct (valid format) email address, doesn't matter if it is abc@xyz.com OR a.b.c@xyz.com – Sreenath Dec 17 '15 at 04:33
  • OK, so instead of doing that, just wait until you have an actual email you need to send to the user. Send it. If you get a bounce back or other error, then SMS the user to notify them that the email failed. Easy, efficient, doesn't annoy users unnecessarily. – John Zwinck Dec 17 '15 at 04:38
  • 2
    `abc$cd@xyz.com` is a perfectly valid address. – tripleee Dec 17 '15 at 05:44