2

I need to include in my C code a POSIX ERE regex compiler/executer. I settled on the native regex.h library with something that looks like the following:

#include <regex.h>

bool
match_posix_regex(const char *pattern, const char *str){
    regex_t regex;
    int reti;

    reti = regcomp(&regex, pattern, REG_EXTENDED);
    if(reti){
        printf("Could not compile the regex\n");
        return false;
    }

    reti = regexec(&regex, str, 0, NULL, 0);
    if(!reti){
        return true;
    }

    else if (reti == REG_NOMATCH){
        return false;
    }

    else{
        printf("ERROR in regex execution\n");
        return false;
    }
}

It came to my attention that this implementation includes support for back-referencing. It is my understanding that POSIX ERE standards do not support back-referencing, however many implementations of these standards do. Looking at the regex.h docs it doesn't seem like I'll be able to disable this feature.

I do not want to include support for back-referencing as it is not included in the standards and furthermore it can result in catastrophic backtracking as described here.

Is there a way I can compile and run a regex in C that is compliant with POSIX ERE standards and does not include back-referencing as a feature?

dsouza
  • 39
  • 5
  • POSIX regex does not allow backreferences and does not cause catastrophic backtracking. – Wiktor Stribiżew Dec 13 '18 at 18:08
  • @WiktorStribiżew I understand that the POSIX ERE standards do not allow back-references, but I am unable to find an implementation in C that does not include this feature. If you run the above code it does indeed compile and understand back-referencing syntax. – dsouza Dec 13 '18 at 18:13
  • There is always the blunt option of pre-processing the regexes before passing them on to the library you are using to filter or escape the non-POSIX features you don't want. Effectively you'd be writing a de-extending wrapper for POSIX-plus regex libraries. However, I would first look around for another regex library that does what you want, and would consider writing one from scratch as the other obvious option. – Tom Dec 13 '18 at 18:18
  • Excellent question. I've somehow assumed after reading `regex(7)` that only BRE would support back references, yet your code proves otherwise! – Antti Haapala -- Слава Україні Dec 13 '18 at 19:34
  • 1
    @Tom I am afraid you might be right with those being my only solutions. I was hoping to avoid the minefield of pre-processing the regexen or writing a library from scratch. – dsouza Dec 13 '18 at 20:16

2 Answers2

2

While ERE regexes are not a regular language (due to parentheses level matching), this aspect does not affect quoting, so it should be possible to write a fairly simple regex such that, if a string matches it, either it's a valid ERE without backreferences or other extensions, or it has mismatched parentheses levels. To do this just treat parentheses as normal characters. Most of the work will be writing the regex for a valid bracket expression. Then, match your input strings against this fixed regex before trying to compile them.

I think the following is a correct regex for bracket expressions, with annotations below the parts:

\[^?]?(\[\.([^.]|\.[^]])+\.]|\[=([^]=]|=[^]])+=]|\[:([^]:]|:[^]])+:]|[^]])*]
       ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^
       collating symbol      equivalence class   character class     char

A "pseudo-ERE" is then:

($bracket|[^[\]|\[[.(\)*+?{|^$])*

where $bracket is the above regex for brackets.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • The one hangup I have with rejecting the regexen that use backreferencing is that something that would be interpreted as back-referencing by the current implementation such as `\1` is actually still valid in POSIX ERE and therefore I don't want to reject it outright. It is my understanding that escaped non-special characters are just interpreted as literals (outside of bracket expressions). – dsouza Dec 13 '18 at 20:12
  • @dsouza: No. Per *9.4.2 ERE Ordinary Characters*, "The interpretation of an ordinary character preceded by an unescaped ( '\\' ) is undefined, except in the context of a bracket expression (see ERE Bracket Expression)." If this were not the case, implementations interpreting `\1` as a backreference would be not just harmful but actively broken and nonconforming to the point of being unable to process well-defined EREs. In ERE, only special chars become literals representing themselves when escaped by a backslash. – R.. GitHub STOP HELPING ICE Dec 13 '18 at 20:25
  • Yes, you are right on the undefined interpretation. I think this may be the answer I'm looking for. So just to make sure I understand, I can write a regex that matches strings that include any non-special characters that are preceded by unescaped backslash outside of bracket expressions. These can safely be rejected. I do not quite understand the regex matching strings with mismatched parentheses aspect of your answer. Could you elaborate if possible? – dsouza Dec 13 '18 at 20:50
  • Conceptually, what you want is a function that takes a string and gives you an answer whether it's a valid regex. You could write a parser yourself, but it would be nice if you could just perform a regex match to check it. However, the language of all possible regexes is not regular because it has grouping. What I'm saying is you can just ignore the grouping and make a regex that matches a complete bracket expression, any non-backslash character other than a bracket, or a backslash followed by a special char, and then apply `*` to it. – R.. GitHub STOP HELPING ICE Dec 13 '18 at 21:45
  • If your candidate matches this pattern, it's either a valid ERE or invalid due to mismatched parens. In particular it's not a pseudo-ERE with backreferences. In the second case (false positive with mismatched parens), `regcomp` will report the error to you. – R.. GitHub STOP HELPING ICE Dec 13 '18 at 21:47
  • Hrm. The regex itself might be tricky. Consider that `(.)[[.[.]]\1[[.].]]` contains a backreference, but `(.)[[.].]\1[.[.]]` does not. – alficles Dec 13 '18 at 22:40
  • @alficles: You're not looking for backrefs. You're looking for valid pseudo-EREs where `()` are treated just as abstract specials not grouping. I don't follow your examples. Backslash is not special in brackets but I think you're mistaken about what the brackets are. – R.. GitHub STOP HELPING ICE Dec 14 '18 at 00:09
  • Ah I forgot the stupid collating symbols stuff. Yes they're part of the (difficult as I noted) expression for a complete bracket expression. – R.. GitHub STOP HELPING ICE Dec 14 '18 at 00:14
  • @alficles: I've updated the answer with proposed regexes. – R.. GitHub STOP HELPING ICE Dec 14 '18 at 02:36
1

I just wanted to point out a solution that I found. There does exist a POSIX ERE implementation that does not support backreferencing. You can expose this interface by enabling the _GNU_SOURCE feature test macro. This allows you to compile the regex with RE_SYNTAX_POSIX_MINIMAL_EXTENDED. I found this documented here. You can then use the re_compile_pattern function.

I know that this may cause issues with portability, but I believe this should be understood on most systems and this portability issue was not really a concern in my case.

dsouza
  • 39
  • 5