3

I'm trying to use regex to parse source files and search for functions in C programs that start with the word "LOG" and may or may not be followed by a second character from the class [1248AFM], which is then followed by an opening parenthesis. This is being developed under Windows using mingw but will ultimately be compiled and run under Linux using gcc. I'm using the Jan Goyvaerts regex tutorial as a guide and it seems like what I'm after is either zero or one matches of the bracket expression expression shown above. Zero or one sounds a lot like the question mark metacharacter but in my experiments I have yet to be able to get that to work following a bracket expression. To illustrate what I'm trying to do I have the short program shown below. Ideally, I would like to have a match on str1 and str2 only. If I compile and run it as shown, I don't get a match on anything. If I leave out the question mark following the bracket expression, I get a match on str2 only, which is what I would expect. In addition to the question mark, I've also tried an interval quantifier of the form {0,1} but had no success with that either. Is there something other than a bracket expression that I should be using?

Dave

#include <stdio.h>
#include <regex.h>

int main(int argc, char **argv) {
  regex_t regex;
  int rtn = regcomp(&regex, "LOG[1248AFM]?(", 0);
  if (rtn) {
    printf("compile failed\n");
    return(1);
  }
  char *str1 = "  LOG(";
  char *str2 = "  LOGM(";
  char *str3 = "  LOG";
  char *str4 = "  LOGJ(";

  int rtn1 = regexec(&regex, str1, 0, NULL, 0);
  int rtn2 = regexec(&regex, str2, 0, NULL, 0);
  int rtn3 = regexec(&regex, str3, 0, NULL, 0);
  int rtn4 = regexec(&regex, str4, 0, NULL, 0);
  printf("str1: %d\nstr2: %d\nstr3: %d\nstr4: %d\n",
    rtn1, rtn2, rtn3, rtn4);

  return(0);
}
David Harper
  • 79
  • 1
  • 10

2 Answers2

3

Like Casimir et Hippolyte said: you need to escape the ? which escaped me when I did the comment. The problem is that you use a string literal, that means you have to escape the escape.

EDIT as user kdhp noted rightfully: the ? is a Gnu extension to the basic regular expression. But the problem stays the same: the need for escapes of the escapes in a C-literal.

#include <stdio.h>
#include <regex.h>

int main(int argc, char **argv) {
  regex_t regex;
  // Gnu extension
  // int rtn = regcomp(&regex, "LOG[1248AFM]\\?(",0);
  // Basic regular expression
  int rtn = regcomp(&regex, "LOG[1248AFM]\\{0,1\\}(",0);
  if (rtn) {
    printf("compile failed\n");
    return(1);
  }
  char *str1 = "  LOG(";
  char *str2 = "  LOGM(";
  char *str3 = "  LOG";
  char *str4 = "  LOGJ(";

  int rtn1 = regexec(&regex, str1, 0, NULL, 0);
  int rtn2 = regexec(&regex, str2, 0, NULL, 0);
  int rtn3 = regexec(&regex, str3, 0, NULL, 0);
  int rtn4 = regexec(&regex, str4, 0, NULL, 0);
  printf("str1: %d\nstr2: %d\nstr3: %d\nstr4: %d\n",
    rtn1, rtn2, rtn3, rtn4);

  return(0);
}

Gives

str1: 0
str2: 0
str3: 1
str4: 1
deamentiaemundi
  • 5,502
  • 2
  • 12
  • 20
  • Maybe the answer should also explain the fundamental differences between the supported regex dialects, i.e. BRE vs ERE? – tripleee Sep 05 '16 at 20:03
  • @tripleee I'm not a linguist and I don't know manure about dialects. I'm a programmer and so I *do* know that the moment I choose a regular expression to solve a problem I will have two problems instead and in an instant. Furthermore: maximum post length is 30k IIRC – deamentiaemundi Sep 05 '16 at 20:53
  • Use of `\?` is a GNU extension to POSIX BREs. EREs (`REG_EXTENDED`) or bracket expressions (`\{0,1\}`) can be used for portability. – kdhp Sep 05 '16 at 22:25
  • @kdhp Argh, and I fell for it! Thanks. Will correct it. – deamentiaemundi Sep 05 '16 at 22:30
  • Do you mean that you don't understand the meaning of th e `REG_EXTENDED` flag? As pointed out in several comments, using that would offer a different solution which in some ways would be more elegant, as well as hopefully sort out a misunderstanding in the original question. – tripleee Sep 06 '16 at 07:17
  • @tripleee I was too subtle, it seems ;-) I'm not an expert in regular expressions (although I once wrote a parser some yearsdecades ago), just a user who avoids using it whenever possible. I also don't think it is an XY problem but if you are of a different opinion: fell free to write you own answer. If done well, I won't be the last one to upvote it, either. – deamentiaemundi Sep 06 '16 at 13:48
2

Part of the problem here stems from an unfortunate confusion between the feature sets of different regular expression dialects.

Long story short, with REG_EXTENDED, you get the grep -E (aka egrep) meaning of some regex constructs.

"e?(grep){3,7}"

where no backslashes are required -- the question mark ? makes the previous expression optional, the parentheses do grouping, and the curly braces express generalized repetition (in this case, between three and seven repetitions).

Without REG_EXTENDED, you get BRE semantics, which requires a backslash before each of these. In a C string, of course, to produce a literal backslash, you need two backslashes, because the backslash is a general C string escape character.

"e\\?\\(grep\\)\\{3,7\\}"

A brief explanation of the history follows, but you could stop reading here and be done.

Basic regular expressions (BRE) are based on the feature set of the original grep by Ken Thompson. The original grep did not have grouping parentheses, the generalized quantification with curly brackets, or even the question mark for expressing optionality. However, the POSIX standard codifies a way to express these constructs even in BRE. Hang on.

Extended regular expressions (ERE) are based on the feature set of egrep which was an extension of grep by mainly Al Aho. It introduced a number of new features, as well as a different internal architecture, based on the then-emerging continued research into the applications of automata theory to string matching (we are talking early to mid 1970s here).

When these were standardized by POSIX, the standard introduced feature parity, but a different surface syntax for these dialects. A somewhat quirky extension of the grep syntax, where backslashes enable, rather than escape, the special meaning of some characters, was introduced in the BRE dialect. This makes BRE backwards compatible with the original grep (as long as you didn't needlessly use backslashes in your regular expressions where previously they had no special meaning), which was an important consideration, but admittedly a design wart.

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • didn't expect that you went that deep into history, too ;-) – deamentiaemundi Sep 07 '16 at 19:10
  • Tangentially, also stumbled over this: https://medium.com/@rualthanzauva/grep-was-a-private-command-of-mine-for-quite-a-while-before-i-made-it-public-ken-thompson-a40e24a5ef48#.u10922d6i – tripleee Sep 08 '16 at 03:53