1

Im trying to search through a text file and find the valid email addresses. Im doing something like this:

    #!/usr/bin/perl -w

my $infile = 'emails.txt';

    open IN, "< $infile" or die "Can't open $infile : $!";

    while( <IN> )
    { 
        if ($infile =~ /^[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,6}$/) 
        { 
            print "Valid \n"; 
        } 
    }

    close IN;

But it doesnt do anything, any help?

Bramble
  • 1,395
  • 13
  • 39
  • 55
  • 1
    You should read RFC 5322 (http://tools.ietf.org/html/rfc5322) because you are missing valid characters – Benoit Nov 24 '10 at 16:19
  • How are the email addresses embedded in the file? One complete address per line? Scattered among other data? Can there be multiple addresses on one line? Can an email address be broken across multiple lines? – Narveson Nov 24 '10 at 16:57

6 Answers6

11

You match the email address regexp against the name of the file. And anyway you should not use regex to validate email address - use Email::Valid

use strict;

use Email::Valid;

my $infile = 'emails.txt';

open my $in, "< $infile" or die "Can't open $infile : $!";

while(my $line = <$in> ) {

    chomp $line;

    if (Email::Valid->address($line)) {

        print "Valid \n";

    }


}

close $in;
jira
  • 3,890
  • 3
  • 22
  • 32
  • 2
    To expand on why this is the right sort of answer, http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html is the sort of regex you need to actually validate email addresses. – Oesor Nov 24 '10 at 16:14
1

You're trying to match $infile, which contains the name of the text file, i.e. 'emails.txt'.

You should be doing something like

while(<IN>) {
     print "Valid \n" if $_ =~ /\bYOURREGEX\b/
}

This way \b matches word boundaries instead of the beginning and end of the line and you can match email addresses contained within another string.

EDIT: But Jira's answer is definitely better, this one just tells you what's wrong.

Hope this helps!

mob
  • 117,087
  • 18
  • 149
  • 283
Bharat
  • 463
  • 1
  • 8
  • 17
1

You'll have problems with this regex unless:

  1. The email address is the only thing in a line of the file
  2. The email address in the file is all caps.

You should replace all A-Z, which only accepts caps, with \p{Alpha} all alpha characters regardless of case. Where you combine it with 0-9 and _. You should instead replace it with \w (any word character).

/^[\w.%+-]+@[\p{Alnum}.-]+\.\p{Alpha}{2,6}$/

This still isn't a valid regex for emails, though, see Benoit's comment--but it might do the job in a pinch.

Community
  • 1
  • 1
Axeman
  • 29,660
  • 2
  • 47
  • 102
0

I don't know Perl, but your Regular Expression is matching the beginning and end of the entire string. Unless you are setting a multi-line flag and/or only having 1 email address per file you won't get results.

Try removing the ^ (beginning of string) and $ (end of string) tokens and see if that helps any.

It might help to post a dataset sample as well. As without a sample I can't help you any further.

Frazell Thomas
  • 6,031
  • 1
  • 20
  • 21
0

Don't you need something like this?

@lines = <IN>;
close IN;

foreach $line (@lines)
{
...
}
Vadim
  • 17,897
  • 4
  • 38
  • 62
  • This is slurping the file into an array. Sometimes there's a good reason to slurp, but in most cases the best way to read the file is as jira has it. – Narveson Nov 24 '10 at 16:53
0

There is a copy of the regex to validate RFC 5322 email addresses here on SO, you know. It looks like this:

$rfc5322 = qr{
    # etc
}x;

It has a thing or two in the # etc elision I’ve made above, which you can check out in the other answer.

By the way, if you’re going to use \b in your regexes, please please be especially careful that you know what it’s touching.

$boundary_before     =  qr{(?(?=\w)(?<!\w)|(?<=\w))}; # like /\bx/
$boundary_after      =  qr{(?(?<=\w)(?!\w)|(?=\w))};  # like /x\b/
$nonboundary_before  =  qr{(?(?=\w)(?<=\w)|(?<!\w))}; # like /\Bx/
$nonboundary_after   =  qr{(?(?<=\w)(?=\w)|(?!\w))};  # like /x\B

That’s seldom what people are expecting.

Community
  • 1
  • 1
tchrist
  • 78,834
  • 30
  • 123
  • 180