Linux posix C regexec() not returning all matches

Question

I have the following script that parses a process memory looking to string matches, everything is ok but dumping the process of an editor (nano in this case) with 1193 possible matches (that works if I dump the memory then do an egrep on it) but my code only outputs 3 matches. Any idea?

#ifdef TARGET_64
// for 64bit target (see /proc/cpuinfo addr size virtual)
 #define MEM_MAX (1ULL << 48)
#else
 #define MEM_MAX (1ULL << 32)
#endif

#define _LARGEFILE64_SOURCE
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/ptrace.h>
#include <regex.h>

int main(int argc, char **argv)
{
        if (argc < 2) {
                printf("Usage: %s <pid>\n", argv[0]);
                exit(1);
        }

        char buf[128];
        int pid = atoi(argv[1]);
        snprintf(buf, sizeof(buf), "/proc/%d/mem", pid);
        int fd = open(buf, O_RDONLY);
        if (fd == -1) {
                fprintf(stderr, "Error opening mem file: %m\n");
                exit(1);
        }

        int status ,i;
        int cflags = REG_EXTENDED;
        regmatch_t pmatch[1];
        const size_t nmatch=1;
        regex_t reg;
        const char *pattern="([a-zA-Z]{18,20})";
        regcomp(&reg, pattern, cflags);

        long ptret = ptrace(PTRACE_ATTACH, pid, 0, 0);
        if (ptret == -1) {
                fprintf(stderr, "Ptrace failed: %s\n", strerror(errno));
                close(fd);
                exit(1);
        }

        unsigned char page[4096];
        unsigned long long offset = 0;

        while (offset < MEM_MAX) {
                lseek64(fd, offset, SEEK_SET);

                ssize_t ret;
                ret = read(fd, page, sizeof(page));

                if (ret > 0) {
                        status = regexec(&reg, page, nmatch, pmatch, 0);
                        if(status == 0){
                                for (i=pmatch[0].rm_so; i<pmatch[0].rm_eo; ++i) {
                                        putchar(page[i]);
                                }
                                printf("\n");
                        }
                }

                offset += sizeof(page);
        }

        ptrace(PTRACE_DETACH, pid, 0, 0);
        close(fd);
        regfree(&reg);
        return 0;
}

nano with pid 2208 with [ Read 1193 lines ] as alpha between 18-20 chars:

root ~/coding/proc/regex # ./memregmatch 22008
ABCABCABCABCABCABC
ABCABCABCABCABCABCAC
ABCCBAABCCBAABCCBABA
root ~/coding/proc/regex #

`regexec` stops parsing at `\0` characters so you need to convert embedded `\0` to something else (and terminate each page with a `\0`). Also, your method of parsing one page at a time will miss strings that extend over a page break. — Klas Lindbäck, Feb 05 '14 at 14:29
Should I add a `page[ret] = '\0';` after `if (ret > 0) {`? Or any idea on how to parse all one time so I don't miss strings? Already made a `libpcre` version that works but I would like to make it work with posix regex. — bsteo, Feb 05 '14 at 14:48
If you have plenty of time to code and you really want to speed it up you should do it the way gnu `grep` does it. Creating a jump table for your special case isn't very hard. Handling page boundaries is a bit more work. If you are ok with the licensing you can probably reuse a lot of the actual source code from gnu `grep`. You can find more information on the implementation here: http://stackoverflow.com/questions/12629749/how-does-grep-run-so-fast — Klas Lindbäck, Feb 05 '14 at 14:58
A simpler way would be to use `egrep`. (I'm not sure whether `egrep` works on files larger than 4 GB though. — Klas Lindbäck, Feb 05 '14 at 14:59
I am certain that a specially made function that only needs to handle one case can be made to run faster than something that has to be able to handle any regular expression. But, using `libpcre` gives shorter development time and is probably fast enough. Such a large file will probably have to be read from disk, so as long as the program can process it as fast as the disk(s) can read it, then you gain nothing from optimizing the code any further. — Klas Lindbäck, Feb 05 '14 at 15:09

score 0 · Accepted Answer · answered Feb 05 '14 at 17:01

Ok, did it with libpcre:

#include <pcre.h>
#include <locale.h>

....

        const char *error;
        int   erroffset;
        pcre *re;
        int   rc;
        int   i;
        int   ovector[100];
        char *regex = "([a-zA-Z]{18,20})";
        re = pcre_compile (regex,          /* the pattern */
                        PCRE_MULTILINE|PCRE_DOTALL|PCRE_NEWLINE_ANYCRLF,
                        &error,         /* for error message */
                        &erroffset,     /* for error offset */
                        0);             /* use default character tables */
        if (!re)
        {
                printf("pcre_compile failed (offset: %d), %s\n", erroffset, error);
        return -1;
        }

....

                if (ret > 0) {
                        //
                        unsigned int offset = 0;
                        while (offset < sizeof(page) && (rc = pcre_exec(re, 0, page, sizeof(page), offset, 0, ovector, sizeof(ovector))) >= 0)
                        {
                                for(i = 0; i < rc; ++i)
                                {
                                        printf("%.*s\n", ovector[2*i+1] - ovector[2*i], page + ovector[2*i]);
                                }
                                offset = ovector[1];
                        }
                        //
                }

Linux posix C regexec() not returning all matches

1 Answers1