0

I have a problem with my function that is trying to find an e-mail addresses. I have no idea what can be the problem :(

static int contains_mail(const unsigned char *buffer, int length, int detmode)
{
    const char *reg_exp = "([A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z0-9._%+-]+)";

    regex_t regex;
    int reti;
    regmatch_t matches[2];

    int start0, end0, start1, end1;

    reti = regcomp(&regex, reg_exp, REG_EXTENDED);

    if(reti){ fprintf(stderr, "Could not compile regex\n"); exit(1); }

    reti = regexec(&regex, buffer, 2, matches, 0);

    start0 = matches[0].rm_so;
    end0 = matches[0].rm_eo;
    start1 = matches[1].rm_so;
    end1 = matches[1].rm_eo;

    printf("start0: %d", start0);
    printf("end0: %d", end0);
    printf("start1: %d", start1);
    printf("end1: %d", end1);

    if( !reti ){
        //printf("1");
        return 1;
    } else {
        //printf("0");
        return 0;
    }
}

Example input file:

dfo gpdf eriowepower riwope d@b.pl rwepoir weporsdfi dsfdfasdas@sdfaasdas.pl OSIDQOPWIEPOQWIE sdfs@asdsa.pl
WERO IWUEOIRU OWIERU WOIER asdas@asdasd.pl
aposidasop aposdi aspod iaspodi aspoid aspodi sdfsddfsd@asdasd.pl
werowerowe

It looks like it started with:

start0: 28end0: 28start1: 1end1: 8

but then it looks like it doesn't know that what is the end of the e-mail so I cannot calculate it :(

Deduplicator
  • 44,692
  • 7
  • 66
  • 118
user3612491
  • 223
  • 3
  • 7
  • 1
    Your ad-hoc regex disallows a number of permitted characters in the localpart. Kudos for at least allowing `+`, though. – tripleee May 07 '14 at 14:11
  • Welcome to SO. Please read [How to Ask](http://stackoverflow.com/questions/how-to-ask) and [help center](http://stackoverflow.com/help) on how to ask a question. Tagging with the proper categories not only enables syntax-highlighting, proper tagging also makes it much more likely someone will find and answer your question. Please refer to the excerpts when tagging, if in doubt even the full tag-wikis. Retagged that for you. – Deduplicator May 07 '14 at 14:13
  • possible duplicate of [Using a regular expression to validate an email address](http://stackoverflow.com/questions/201323/using-a-regular-expression-to-validate-an-email-address) – Deduplicator May 07 '14 at 14:17

2 Answers2

1

A quick question, how are you passing in the input file? As if I define and then call it like the following:

char string[] = "dfo gpdf eriowepower riwope d@b.pl rwepoir weporsdfi dsfdfasdas@sdfaasdas.pl OSIDQOPWIEPOQWIE sdfs@asdsa.pl\n\
WERO IWUEOIRU OWIERU WOIER asdas@asdasd.pl\n\
aposidasop aposdi aspod iaspodi aspoid aspodi sdfsddfsd@asdasd.pl\n\
werowerowe\n";

contains_mail(string, 0, 0);

And modify your contains_mail function to repeatedly call regexec as following:

reti = regexec(&regex, buffer, 2, matches, 0);
while (reti == 0) {
        start0 = matches[0].rm_so;
        end0 = matches[0].rm_eo;
        start1 = matches[1].rm_so;
        end1 = matches[1].rm_eo;

        printf("start0: %d ", start0);
        printf("end0: %d\n", end0);
        printf("start1: %d ", start1);
        printf("end1: %d\n", end1);
        printf("email: %.*s\n", end1 - start1, buffer + start1);
        buffer += end1;
        reti = regexec(&regex, buffer, 2, matches, REG_NOTBOL);
} 

I get all the matches:

$ ./email_regex
start0: 28 end0: 34
start1: 28 end1: 34
email: d@b.pl
start0: 19 end0: 42
start1: 19 end1: 42
email: dsfdfasdas@sdfaasdas.pl
start0: 18 end0: 31
start1: 18 end1: 31
email: sdfs@asdsa.pl
start0: 28 end0: 43
start1: 28 end1: 43
email: asdas@asdasd.pl
start0: 47 end0: 66
start1: 47 end1: 66
email: sdfsddfsd@asdasd.pl

I agree with others comments your regex might not be the best for getting email addresses. But what are you actually trying to do?

Timothy Brown
  • 2,220
  • 18
  • 22
1

The function regexec always finds at most one ocurrence of your regex. The first match (with index 0) contains start end end positions for the whole match, the following matches contain data for subexpressions in parentheses. (The parentheses around the whole expressions in your example serve no purpose, but they lead to the same positions for match 0 and 1, perhaps misleading you into thinking that the same xpressions is parsed over and over.)

You can look for your expression in a while loop, where you advance the pointer after a successful match in oder to look for more e-nail addresses. The following modification of your code prints all e-mail addresses found and returns the number of matches.

static int contains_mail(const char *buffer, int length)
{
    const char *reg_exp =
        "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z0-9._%+-]+";

    regex_t regex;
    regmatch_t match;
    int count = 0;

    if (regcomp(&regex, reg_exp, REG_EXTENDED) < 0) {
        fprintf(stderr, "Could not compile regex\n");
        exit(1);
    }

    while (regexec(&regex, buffer, 1, &match, 0) == 0) {
        int start = match.rm_so;
        int end = match.rm_eo;

        printf("%.*s\n", end - start, buffer + start);
        count++;
        buffer = buffer + end;
    }

    return count;
}
M Oehm
  • 28,726
  • 3
  • 31
  • 42