0


I am trying to write a simple program that finds all txt files in HTML web page.
I am using C with libcurl(in order to download the page from the internet) and PCRE for scanning the page.

I am using the next pattern - /\w+.txt/g and the next code -

if(htmlContent == NULL) return;
char pattern[] = "/\\w+.txt/g";
const char *error;
int erroffset, ovector[OVECCOUNT], htmlLength = (int)(sizeof(htmlContent) / sizeof(char));
pcre *re = pcre_compile(pattern,0,&error,&erroffset,NULL);
if (re == NULL) {
    printf("PCRE compilation failed at offset %d: %s\n", erroffset, error);
    return;
}

int rc = pcre_exec(re,NULL,htmlContent,htmlLength,0,0,ovector,OVECCOUNT);
if(rc < 0) {
    pcre_free(re);
    return;
}
if (rc == 0)
{
    rc = OVECCOUNT/3;
    printf("ovector only has room for %d captured substrings\n", rc - 1);
}

int i;
for (i = 0; i < rc; i++)
{
    char *substring_start = htmlContent + ovector[2*i];
    int substring_length = ovector[2*i+1] - ovector[2*i];
    printf("%2d: %.*s\n", i, substring_length, substring_start);
}

I get zero results while running the code (btw this code is just from the curl callback)

Yosi
  • 2,936
  • 7
  • 39
  • 64
  • 4
    Just say no to [parsing HTML with REs](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). – Jerry Coffin Jul 24 '11 at 20:43
  • 1
    I don't know about PCRE API, but I don't think you should use any double quoting in your expression (only silly PHP does that), and you also want to escape the dot: `"\\w+\\.txt"` – Qtax Jul 24 '11 at 21:45
  • 1
    Qtax is right: when you use the PCRE library directly, you do *not* use regex delimiters (`/` in this case). PHP requires them, presumably to make its regex syntax look more like Perl's, but it strips them off before passing the string to PCRE. The `g` modifier isn't needed, either. – Alan Moore Jul 24 '11 at 23:59
  • 1
    The length calculated using `sizeof(htmlContent) / sizeof(char)` probably is wrong. If `htmlContent` is a pointer, it'll only be 4 or 8, and dividing by doesn't do much either. You probably want `strlen(htmlContent)` instead. – unpythonic Jul 25 '11 at 00:37

1 Answers1

0

The pattern should be "\\w+\\.txt\\b". \b will stop the pattern from matching foo.txtbar.

CJ Dennis
  • 4,226
  • 2
  • 40
  • 69