2

So my current homework is to implement a DBMS-like system in C, and I've decided to use the linux POSIX regex matching library. I'm writing toy codes to learn stuff, and this is my testing for the "command parsing" part of the DBMS. The post is a little long, but please bear with me.

So, the full program code, based on this example:

char *toSubstring(const char *str, int start, int end)
{
  char *buffer = malloc((end - start + 1) * sizeof *buffer);
  memcpy(buffer, str + start, end - start);
  *(buffer + (end - start)) = '\0';

  return buffer;
}

int compile_regex(regex_t *r, const char *r_pattern)
{
  int status = regcomp(r, r_pattern, REG_EXTENDED|REG_NEWLINE);

  if (status != 0) {
    char err_msg[MAX_ERR_MSG];
    regerror(status, r, err_msg, MAX_ERR_MSG);
    printf("REGEX error compiling '%s': %s\n", r_pattern, err_msg);

    return status;
  }
  return 0;
}

int match_regex(regex_t *r, const char *r_text, char ***m_array, int n_matches)
{
  int i, nomatch;
  regmatch_t m_osets[n_matches];
  *m_array = malloc(n_matches * sizeof **m_array);

  nomatch = regexec(r, r_text, n_matches, m_osets, 0);

  if (nomatch) {
    printf("No matches\n");

    return nomatch;
  }

  for (i = 0; i < n_matches ; i++) {
    int start = (int) m_osets[i].rm_so;
    int stop = (int) m_osets[i].rm_eo;

    if (start==-1 || stop==-1) {
      *(*(m_array)+i) = NULL;

      printf("WARNING: Match block %d is void!\n", i);
    } else {
      *(*(m_array)+i) = toSubstring(r_text, start, stop);

      printf("Match block %d @bytes %d:%d\n", i, start, stop);
    }
  }
  return 0;
}

void chafree(char **c, int n)
{
  int i;

  for (i = 0; i < n; i++) {
    if (*(c+i)!=NULL)
      free(*(c+i));
  }
  free(c);
}

int main(int argc, char **argv)
{
  int i, m;
  regex_t r;
  const char * r_text = *(argv+1);
  const char * r_pattern = *(argv+2);

  char **matches = NULL;

  if (argc != 4) {
    printf("Usage: ./program_name \"r_text\" \"r_pattern\" n_matches\n");

    exit(1);
  }
  printf("Sweeping '%s' for '%s'\n", r_text, r_pattern);

  compile_regex(&r, r_pattern);
  match_regex(&r, r_text, &matches, atoi(*(argv+3)));

  if (matches != NULL) {
    for(i=0;i<atoi(*(argv+3));i++){

      if(*(matches+i)!=NULL)
        printf("$%d --> %s\n", i, *(matches+i));
      else printf("$%d --> %p\n", i, *(matches+i));
    }
    chafree(matches, atoi(*(argv+3)));
  }
  regfree(&r);

  return 0;

A slight difference from the example I've presented is the fact that I'm storing the matches and capture groups in a string vector.

Now, when I run the program with:

./regex_drills "insert whatever bananana" "([[:alpha:]]+)[[:blank:]]*([^\0]*)" 3

The output I receive is:

Sweeping 'insert whatever bananana' for '([[:alpha:]]+)[[:blank:]]*([^\0]*)'
Match block 0 @bytes 0:23
Match block 1 @bytes 0:5
Match block 2 @bytes 7:23
$& --> insert whatever bananana!
$1 --> insert
$2 --> whatever bananana

The regex pattern seems to be alright, according to regex 101, but notice the stray " ! " at the end of the full expression. Altough the capture groups are correctly parsed, the unusual character happens (so far) on the full expression range and, for the test cases used up until now, only when it is exactly 24 bytes long. It's possibly a very stupid mistake and I'm sorry for it.

Also, any suggestions on how to handle regex in a better and perhaps more elegant way, in C, would be very welcome. Many thanks in advance.

Edit

So, according to responses it was an offset error internal to toSubstring. It's fixed now and the outputs are smooth, as they ought to. I've also cleaned up the code a bit, as suggested in the comments.

A somewhat invasive run using valgrind reveals no errors or undefined behaviours, unlike what was happening previously:

$ valgrind --leak-check=full --track-origins=yes --show-reachable=yes ./regex_drills "insert whatever bananana" "([[:alpha:]]+)[[:blank:]]*([^\0]*)" 3

==7051== Memcheck, a memory error detector
==7051== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==7051== Using Valgrind-3.10.0 and LibVEX; rerun with -h for copyright info
==7051== Command: ./regex_drills insert\ whatever\ bananana ([[:alpha:]]+)[[:blank:]]*([^\\0]*) 3
==7051== 
Sweeping 'insert whatever bananana' for '([[:alpha:]]+)[[:blank:]]*([^\0]*)'
Match block 0 @bytes 0:24
Match block 1 @bytes 0:6
Match block 2 @bytes 7:24
$0 --> insert whatever bananana
$1 --> insert
$2 --> whatever bananana
==7051== 
==7051== HEAP SUMMARY:
==7051==     in use at exit: 0 bytes in 0 blocks
==7051==   total heap usage: 167 allocs, 167 frees, 18,458 bytes allocated
==7051== 
==7051== All heap blocks were freed -- no leaks are possible
==7051== 
==7051== For counts of detected and suppressed errors, rerun with: -v
==7051== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

I'd like to thank everyone for the quick response. I've learned plenty from this.

DVNO
  • 163
  • 2
  • 6
  • 1
    Just a small nitpick: Why are you using pointer arithmetic for the `argv` access? Usually one uses normal array-indexing syntax for that, e.g. `r_text = argv[1]`. Same thing with other pointers. – Some programmer dude Nov 11 '14 at 08:01
  • 1
    And please remove the C++ tag, your code is *not* C++, and a C++ solution would look *very* different, because C and C++ are very different languages. – Some programmer dude Nov 11 '14 at 08:03
  • 1
    Also, [in C you should not cast the result of `malloc`](http://stackoverflow.com/questions/605845/do-i-cast-the-result-of-malloc). And `sizeof(char)` is defined by the C specification to always be `1`. – Some programmer dude Nov 11 '14 at 08:04
  • 1
    And you have [*undefined behavior*](http://en.wikipedia.org/wiki/Undefined_behavior) in your code as you write beyond the limits of allocated memory. In the `toSubstring` function, you allocate `end-start+1` bytes, which means the top index you can access in the resulting memory is `end-start`, yet you write the string terminator at position `end-start+1` which is one byte beyond the end of the allocated memory. – Some programmer dude Nov 11 '14 at 08:09
  • The pointer arithmetic is just a habit, from what I've learned both syntaxes are interpreted the same way by the C preprocessor. The `sizeof(char)` is simply to keep it more explicit, as the code will be read and my writing will be evaluated. Also, according to [cplusplus](http://www.cplusplus.com/reference/cstdlib/malloc/), malloc returns a pointer of the type void*, so why shouldn't I cast it to my desired type ? – DVNO Nov 11 '14 at 08:12
  • 1
    As for not casting `malloc`, please follow the link and read the answers. Regarding the pointer arithmetic it's not handled by the preprocessor, it's also often easier to read (and less to write) using array-indexing. – Some programmer dude Nov 11 '14 at 08:17
  • @JoachimPileborg Indeed it is a silly offset error; Seems I'm copying undefined characters over to my target strings... It's fixed ! Thank you for the quick comments ! – DVNO Nov 11 '14 at 08:19
  • As for the regex, why do you need `[^\0]` ? Is this typical for POSIX in C so you don't run over pointer boundries, or doesn't the regex handle itterator style, or start/end ptr or start ptr + a length? –  Nov 11 '14 at 08:24

1 Answers1

3

Your toSubstring function has an off-by-one error. end is an exclusive bound, so the length of the substring is len = end - start. You must allocate one more than that to store the terminating null character, but you should only copy len chars and, mor importantly, you should write the terminating null to buffer[len], not buffer[len + 1]:

char *toSubstring(const char *str, int start, int end)
{
  char *buffer = malloc(end - start + 1);
  memcpy(buffer, str + start, end - start);
  *(buffer + (end - start)) = '\0';

  return buffer;
}

Maybe you didn't intend end to be exclusive, because you adjust it when you call toSubstring. You can keep these semantics and allocate end - start + 2 chars (and keey the rest of your code), but in C, the upper bound is usually exclusive, so I recommend to use the function as above and to call it like this:

*(*(m_array) + i) = toSubstring(r_text, start, stop);

instead of (..., stop - 1).

M Oehm
  • 28,726
  • 3
  • 31
  • 42