1

I needed to find a way to get a pointer to a substring (like strstr, first occurence) to more than one possible needles (patterns) a large string. C's standard strstr() does only support one needle, I need 2 needles or even 3. Why all this? I need to be able to "tokenize" a html document into parts to further parse these "snippets". The "anchor" I need for tokenizing can vary, for example <div class="blub"> or <span id="bla and the html tags to be used as an token could contain numbers in the id/class attribute values (there for I could use \d+ or such to filter).

So I thought to write a function in using posix regex.

The function looks like this:

char * reg_strstr(const char *str, const char *pattern) {
    char *result = NULL;
    regex_t re;
    regmatch_t match[REG_MATCH_SIZE];

    if (str == NULL)
        return NULL;

    if (regcomp( &re, pattern, REG_ICASE | REG_EXTENDED) != 0) {
        regfree( &re );         
        return NULL;
    }

    if (!regexec(&re, str, (size_t) REG_MATCH_SIZE, match, 0)) {

        fprintf( stdout, "Match from %2d to %2d: \"%s\"\n",
             match[0].rm_so,
             match[0].rm_eo,
             str + match[0].rm_so);
        fflush(stdout);

        if ((str + match[0].rm_so) != NULL) {
            result = strndup(str + match[0].rm_so, strlen(str + match[0].rm_so));
        }
    }

    regfree( &re );

    return result;
}

The constant REG_MATCH_SIZE is 10

First of all, does that idea using regex as an extended strstr function make sense at all?

In simple test cases that function seem to work fine:

char *str_result = reg_strstr("<tr class=\"i10\"><td><div class=\"xyz\"><!--DDDD-1234--><div class=\"xx21\">", "<div class=\"xyz\">|<div class=\"i10 rr");

printf( "\n\n"
    "reg_strstr result: '%s' ..\n", str_result);

free( str_result) ;

Using that function in a in a real case environment using a complete HTML document does not to work like expected. It does not find the pattern. Using this function on a memory mapped string (I use a mmap'ed file as a cache for tmp. storage while parsing HTML document data).

EDIT:

Here in a loop like used:

Variables: parse_tag->firsttoken and parse_tag->nexttoken are the html anchors I try to match, just like illustrated above. doc is the input document, from the mmap'ed cache an allocated and '\0' terminated string (with strndup()). Code below works with strstr() as expected. If I find out, the idea using regex strstr really work for me I can rewrite the loop and maybe return all matches from reg_strstr (as an stringlist or such). So for now I am just trying ...


...
char *tokfrom = NULL, *tokto = NULL;
char *listend = NULL;

/* first token found ? */ if ((tokfrom = strstr(doc, parse_tag->firsttoken)) != NULL) { /* is skipto_nexttoken set ? */ if (!parse_tag->skipto_nexttoken) tokfrom += strlen(parse_tag->firsttoken); else { /* ignore string between firsttoken and first nexttoken */ if ((tokfrom = strstr(tokfrom, parse_tag->nexttoken)) == NULL) goto end_parse; }

/* no listend tag found ? */
if (parse_tag->listend == NULL ||
    (listend = reg_strstr(tokfrom, parse_tag->listend)) == NULL) {
    listend = doc + strlen(doc);
}

*listend = '\0';        /* truncate */

do {
    if((tokto = reg_strstr(tokfrom + 1, parse_tag->nexttoken)) == NULL)
        tokto = listend;
    tokto--;  /* tokto-- : this token up to nexttoken */

    if (tokto <= tokfrom)
        break;

    /* do some filtering with current token here ... */
    /* ... */

} while ((tokfrom = tokto + 1) < listend);

} ...

EDIT END

Do I miss something here? Like said, is this possible at all what I try to accomplish? Is the regex pattern errornous?

Suggestions are welcome!

Andreas

Andreas W. Wylach
  • 723
  • 2
  • 10
  • 31
  • 2
    Two things come to mind here, this: http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html and this: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Robert Groves May 05 '11 at 04:24
  • @Robert, I think (and hope) the OP is trying to *tokenize* HTML, rather than actually parse it. (If they are trying to parse it, though...) – David X May 05 '11 at 04:33
  • @Robert: I totally agree with you (and others) about "parsing" html with regex. It would be a pain ...". For such problem I use libxml2 (it can be used even on not well formed html). So far so good. But using regex for extracting information and some filtering I think is legitim. – Andreas W. Wylach May 05 '11 at 04:47
  • I would be surprised if libxml2 could parse html, which is not very similar to xml... – R.. GitHub STOP HELPING ICE May 05 '11 at 05:23
  • @R..: libxml2 has an htmlparser module included which implements an HTML 4.0 non-verifying parser. Its pretty similar to the XML one and does its work (at least what I would need). Sorry if I was a bit unclear on my statement:-) – Andreas W. Wylach May 05 '11 at 11:30

3 Answers3

1

I tried you code on a test HTML file that I simply input from a text-file through stdin via redirection, and that seemed to work just fine with repeated reads to fgets(). I would then suspect that the issue is somewhere in the formatting of the string-data in you memory-mapped file. My suspicion is that there is a null-terminating character somewhere in your memory mapped file, such that if you are simply using the memory mapped file itself as a char buffer, it is ending the string far earlier than you were expecting.

Secondly, you are only returning the first match plus the rest of the string, which would mean the entire file from the first match onwards if you're using the pointer to the memory mapped file as your str parameter. I'm suspecting your "real" implementation is a bit different if you're looking to tokenize the file?


EDIT

I've been looking at your concept code, and it does seem to be working overall. I made a couple modifications just to help me print things out, but here is what I'm compiling (very down-and-dirty for the file-memory-mapping just to check if the regex code is working):

#include <regex.h>
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <sys/mman.h>
#include <unistd.h>
#include <fcntl.h>

#define REG_MATCH_SIZE 10
#define FILE_SIZE 60000

static int total_matches = 0;

char * reg_strstr(const char *str, const char *pattern) 
{
    char *result = NULL;
    regex_t re;
    regmatch_t match[REG_MATCH_SIZE];

    if (str == NULL)
        return NULL;

    if (regcomp( &re, pattern, REG_ICASE | REG_EXTENDED) != 0) {
        regfree( &re );         
        return NULL;
    }

    if (!regexec(&re, str, (size_t) REG_MATCH_SIZE, match, 0)) {

        fprintf( stderr, "@@@@@@ Match from %2d to %2d @@@@@@@@@\n",
             match[0].rm_so,
             match[0].rm_eo);

    total_matches++;

        if ((str + match[0].rm_so) != NULL) {
            result = strndup(str + match[0].rm_so, strlen(str + match[0].rm_so));
        }
    }

    regfree( &re );

    return result;
}


int main()
{   
    int filedes = open("testhtml.txt", O_RDONLY);

    void* buffer = mmap(NULL, FILE_SIZE, PROT_READ, MAP_PRIVATE, filedes, 0); 

    char* str_result;
    char* temp_buffer = strdup((char*)buffer);
    while(str_result = reg_strstr(temp_buffer, "<div"))
    {
        char* temp_print = strndup(str_result, 30);
        fprintf(stderr, "reg_strstr result: '%s' ..\n\n", temp_print);
        free(temp_print);
        free(temp_buffer);
        temp_buffer = strdup(str_result + 1);
        free( str_result) ;
    }

    fprintf(stderr, "Total Matches: %d\n", total_matches);

    return 0;
}

Just using the simple match for "<div", if I run it on the entire HTML source for a page like this one here at Bloomberg, I get a total of 87 matches, and I get something that is equivalent to what you would get with a repeated call to the standard strstr(). For instance, sample output looks like (note: I've cut off the match on the return string after 30 characters for sanity's sake):

@@@@@@ Match from 5321 to 5325 @@@@@@@@@
reg_strstr result: '<div id="noir_dialog" class="p' ..

@@@@@@ Match from 362 to 366 @@@@@@@@@
reg_strstr result: '<div id="container" class="mod' ..

The matching indexes change of course since the new input string is shorter than the previous input string, so that's why you see a match that starts at 5321, but then the next match is at 362 ... the overall offset would be at 5683 in the original string. With a different regular expression I'm sure you would get different results, but overall it seems that your concept is working, or at least is working like strstr() would work, that is it's returning the entire string starting at the match to the substring all the way to the end of the string.

If you're not getting the results you're expecting (I'm not sure exactly what you're not getting), then I would say the problem is either in the regular expression itself, or in the loop, in that you may have your indexing off (i.e., with totok-- you could be creating a loop for yourself that simply keeps returning a match at the same point in the string).

Jason
  • 31,834
  • 7
  • 59
  • 78
  • Indeed I did not terminate string from the mmap'ed cache. Now I return an allocated and terminated string with `strndup()' from the cache. First I thought that's the problem, but `reg_strstr()` is still not working like expected. I try to think what else problem could be there. I added the loop as I use the `reg_strstr()` fct in my question, maybe I oversee something there. Like stated above, that loop and how I coded it works find with `strstr()`. – Andreas W. Wylach May 05 '11 at 06:52
  • @Andreas Check my new edited post and see if there is anything there that might help out. I'm not exactly sure what the un-expected output you're getting is, but it seems, at least from my end, that the concept is working. – Jason May 05 '11 at 14:52
  • Thanks for your effort and testing this out. I appreciate this very much. Yes, indeed it works. Even with a or'ed regex match (such like "a|b"). I think in my loop is something wrong with the doc string handling (I just wonder because it does work with the regular strstr() fct). The handling of temp_buffer and fresh allocation for the next call with reg_strstr seems to be a good thing. Your help get me one step further. Once again, thanks! – Andreas W. Wylach May 05 '11 at 15:50
0

Make sure the data you are loading is null-terminated. Arg 2 of regexec must be a null-terminated string.

Robert Groves
  • 7,574
  • 6
  • 38
  • 50
0

Why don't you simply use a finite state machine which processes one character at a time and parses HTML tags. This way, you'll also get rid of the problem of HTML comments. Think of the following cases:

Anchor tag in HTML comment:

<!-- <my anchortag="foo"> -->

Comment in HTML attribute:

<some tag="<!--"> <my anchortag="foo"> </some tag="-->">

With regular expressions, you'll have a hard time dealing with these cases.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. (Jamie Zawinski)

Philip
  • 5,795
  • 3
  • 33
  • 68
  • Thanks for your answer; Like stated above, I just tokenize the HTML (not parsing). My `reg_strstr()` simply is a tokenizing fct that should return a pointer to the pattern matched text (just like `strstr()` does. This is what I try to achieve. I am not implementing a HTML Parser. See my question the code snippet with the loop. The reason why I want a regex-based `strstr()` is to be more fexible setting the needle / pattern. – Andreas W. Wylach May 05 '11 at 08:48