I needed to find a way to get a pointer to a substring (like strstr, first occurence) to more than one possible needles (patterns) a large string. C's standard strstr()
does only support one needle, I need 2 needles or even 3.
Why all this? I need to be able to "tokenize" a html document into parts to further parse these "snippets". The "anchor" I need for tokenizing can vary, for example <div class="blub">
or <span id="bla
and the html tags to be used as an token could contain numbers in the id/class attribute values (there for I could use \d+
or such to filter).
So I thought to write a function in using posix regex.
The function looks like this:
char * reg_strstr(const char *str, const char *pattern) {
char *result = NULL;
regex_t re;
regmatch_t match[REG_MATCH_SIZE];
if (str == NULL)
return NULL;
if (regcomp( &re, pattern, REG_ICASE | REG_EXTENDED) != 0) {
regfree( &re );
return NULL;
}
if (!regexec(&re, str, (size_t) REG_MATCH_SIZE, match, 0)) {
fprintf( stdout, "Match from %2d to %2d: \"%s\"\n",
match[0].rm_so,
match[0].rm_eo,
str + match[0].rm_so);
fflush(stdout);
if ((str + match[0].rm_so) != NULL) {
result = strndup(str + match[0].rm_so, strlen(str + match[0].rm_so));
}
}
regfree( &re );
return result;
}
The constant REG_MATCH_SIZE is 10
First of all, does that idea using regex as an extended strstr function make sense at all?
In simple test cases that function seem to work fine:
char *str_result = reg_strstr("<tr class=\"i10\"><td><div class=\"xyz\"><!--DDDD-1234--><div class=\"xx21\">", "<div class=\"xyz\">|<div class=\"i10 rr");
printf( "\n\n"
"reg_strstr result: '%s' ..\n", str_result);
free( str_result) ;
Using that function in a in a real case environment using a complete HTML document does not to work like expected. It does not find the pattern. Using this function on a memory mapped string (I use a mmap'ed file as a cache for tmp. storage while parsing HTML document data).
EDIT:
Here in a loop like used:
Variables: parse_tag->firsttoken
and parse_tag->nexttoken
are the html anchors I try to match, just like illustrated above. doc is the input document, from the mmap'ed cache an allocated and '\0' terminated string (with strndup()
).
Code below works with strstr()
as expected. If I find out, the idea using regex strstr really work for me I can rewrite the loop and maybe return all matches from reg_strstr (as an stringlist or such). So for now I am just trying ...
...
char *tokfrom = NULL, *tokto = NULL;
char *listend = NULL;
/* first token found ? */ if ((tokfrom = strstr(doc, parse_tag->firsttoken)) != NULL) { /* is skipto_nexttoken set ? */ if (!parse_tag->skipto_nexttoken) tokfrom += strlen(parse_tag->firsttoken); else { /* ignore string between firsttoken and first nexttoken */ if ((tokfrom = strstr(tokfrom, parse_tag->nexttoken)) == NULL) goto end_parse; }
/* no listend tag found ? */
if (parse_tag->listend == NULL ||
(listend = reg_strstr(tokfrom, parse_tag->listend)) == NULL) {
listend = doc + strlen(doc);
}
*listend = '\0'; /* truncate */
do {
if((tokto = reg_strstr(tokfrom + 1, parse_tag->nexttoken)) == NULL)
tokto = listend;
tokto--; /* tokto-- : this token up to nexttoken */
if (tokto <= tokfrom)
break;
/* do some filtering with current token here ... */
/* ... */
} while ((tokfrom = tokto + 1) < listend);
} ...
EDIT END
Do I miss something here? Like said, is this possible at all what I try to accomplish? Is the regex pattern errornous?
Suggestions are welcome!
Andreas