0

I am trying to use C's PCRE library to then use regex for matching multiple links to USA Senator's from the webpage.

So to do this I need the regex to be able to return 100 matches for me so I can then print out the web addresses to the emails.

From my research, it looks like the PCRE library is going to be the way to do this but I don't know how to get multiple matches from a string.

This is the regex pattern that I am going to be using

Contact:\s+<a\s+(?:[^>]*?\s+)?href=(["'])(.*?)\1

Here is my current code that I am going to be working with

#include <stdio.h>
#include <stddef.h>
#include <stdlib.h>
#include <unistd.h>
#include <pcre.h>

int main() {


    // initiate all used Variables
    FILE *file;
    char *buffer;
    long size;

    //Wget on Senate webpage
    system("wget -q http://www.senate.gov/general/contact_information/senators_cfm.cfm");


    // Attempt to open file
    file = fopen("senators_cfm.cfm", "r");

    if(file == NULL){

        printf("Was unable to open file \n");
        return 1;        

    }

    //Attempt to read to end of file
    fseek(file, 0L, SEEK_END);



    //Determine the number of bytes that were in the file
    size = ftell(file);

    //Attempt to allocate the number of bytes needed
    buffer = (char*) calloc(size, sizeof(char));    
    if(buffer == NULL){

        printf("Unable to allocate memory needed \n");
        return 1;
    }


    //Reset the reader to start of file
    rewind(file);


    //Read whole file into buffer
    fread(buffer, sizeof(char), size, file);


    //Close file
    fclose(file);


    //Free all information that we allocated memory for
    free(buffer);

    unlink("senators_cfm.cfm");
    return 0;
}
  • Have you considered using a programming language that would support string as a primitive datatype and possibly even have an HTTP client in the standard library?? – Antti Haapala -- Слава Україні Sep 09 '19 at 18:38
  • It is part of the assignment for this course to use C or C++ (i have not used C++ before) and then I am wanting to use Regex along with it. – Jeffrey Hennen Sep 09 '19 at 18:41
  • I was kinda liking the approach. If you have the page in `buffer` and just want to scrape all the e-mails and web-pages, why not just step through the buffer with either `strstr` for `"http"` or `strchr` and `'@'` and then bracket each link or e-mail and extract them? – David C. Rankin Sep 09 '19 at 18:42
  • @DavidC.Rankin, sorry, I guess I am not following exactly what you mean. You are suggesting that I do something like a while loop that would continue to try and find an occurrence of say that link's web address to the email till I don't find any? – Jeffrey Hennen Sep 09 '19 at 18:48
  • Notice that if you're targeting a POSIX system you'll have the non-PCRE Regex engine at your disposal straight from the standard library – Antti Haapala -- Слава Україні Sep 09 '19 at 18:50
  • An example at https://stackoverflow.com/questions/36975020/count-number-of-matches-using-regex-h-in-c – Antti Haapala -- Слава Україні Sep 09 '19 at 18:51
  • @AnttiHaapala, So if I used say regex.h which is in the standard library. I would not be able to use my regex though, am I correct in that assumption? – Jeffrey Hennen Sep 09 '19 at 18:58
  • It does need drastic modifications, yes. But the intent is expressible as a basic regular expression. The `.*?` is tricky but you can code it as a branch instead – Antti Haapala -- Слава Україні Sep 09 '19 at 19:02
  • @JeffreyHennen yes. While you can use a regex, if you look at the decade long debate on just what regex actually matches all valid e-mail addresses, then just choosing the regex is a bit daunting. On the other hand, walking a couple of pointers from the beginning to end of `buffer` locating `"http"` and iterating the 2nd pointer to the end of the link and extracting it, and scanning forward for `'@'` and then backing up to the beginning of the e-mail, scanning forward with the 2nd pointer to the end and extracting the e-mail, and repeating until you run out of buffer seems straight-forward. – David C. Rankin Sep 09 '19 at 19:06
  • In any case I suggest you read [this](https://stackoverflow.com/a/1732454/918959) if not for anything else then for fun ;) – Antti Haapala -- Слава Україні Sep 09 '19 at 19:09
  • Look at the PCRE2 [demo program](http://pcre.org/current/doc/html/pcre2demo.html) and other [PCRE2 documentation](http://pcre.org/current/doc/html/). Adapt the demo to your context — it won't be hard and it will be educational (and it's what I'd do if I was going to write code to answer your question). – Jonathan Leffler Sep 09 '19 at 20:59

0 Answers0