Defining new POSIX-like character class names in C/C++

Question

I'm currently working on a project in c++ to use regex as HTTP FRC rules. In the RFC 1945, Chapter 2.2 - Basic Rules there are the following rules:

CHAR           = <any US-ASCII character (octets 0 - 127)>

CTL            = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
CR             = <US-ASCII CR, carriage return (13)>
LF             = <US-ASCII LF, linefeed (10)>
SP             = <US-ASCII SP, space (32)>
HT             = <US-ASCII HT, horizontal-tab (9)>

CRLF           = CR LF

LWS            = [CRLF] 1*( SP | HT )

word           = token | quoted-string

token          = 1*<any CHAR except CTLs or tspecials>
tspecials      = "(" | ")" | "<" | ">" | "@"
               | "," | ";" | ":" | "\" | <">
               | "/" | "[" | "]" | "?" | "="
               | "{" | "}" | SP | HT

quoted-string  = ( <"> *(qdtext) <"> )
qdtext         = <any CHAR except <"> and CTLs, but including LWS>

What I'm interested in is the usage of character classes like [:digit:] or at least recycling the regex. The pseudo-code would become something like this (\ is already escaped and regex is already in string form):

CHAR           = "\\x00-\\x7F"

CTL            = "\\x00-\\x19\\x7F"
CR             = "\\r"
LF             = "\\n"
SP             = "\\x20"
HT             = "\\t"

//here I start "recycling" old regexes
CRLF           = "[:CR:][:LF:]"

LWS            = "[:CRLF:]* ( [:SP:] | [:HT:] )+"

//here a declaration might happen before using the token or quoted_string class
word           = "[:token:] | [:quoted_string:]"

token          = "( (?= [[:CHAR:]] ) [^[:tspecial:][:CTL:]] )+"
tspecials      = "()<>@,;:\\\\\\"/\\[\\]?={}[:SP:][:HT:]"

quoted_string  = " ( \\" ([:qdtext:])* \\" ) "
//Little trick to allow LWS but not CTLs: https://stackoverflow.com/a/18017758/9373031
qdtext         = "(?=[[:CHAR:]]) ( [:LWS:] | [^\\"[:CTL:]] )"

What I tried so far is to store them as string and then chain them together with a +, but looked ugly and not very optimized. Of course I could repeat some regexs but it started becoming an enormous monster the further I went.

I tried googling a while, but nor did I find anything about adding custom POSIX-like classes, neither did I find anything about recycling (and optimizing?) regexs.

What I need to do is to optimize and prettify regex originating string such that they could be parsed into a new one as POSIX-like classes or in some other way (code in C/C++):

std::regex CR ("\\r");
std::regex LF ("\\r");

std::regex CRLF ("[:CR:] [:LF:]");

Option 1:
[:CR:] [:LF:] would be expanded to \\r \\n and at compilation would become: std::regex CRLF ("\\r \\n");

Option 2:
[:CR:] [:LF:] would be "expanded" as "two functions" to optimize regex at run-time.

So far I found std::ctype_base has the static methods used for classnames in the std::regex_traits<CharT>::lookup_classname function, that should be used for finding defined classnames: is it possible to extend the masks used?

It might be interesting as well in other languages, but I am mostly worried in creating a custom POSIX-like class or something similar in C/C++ specifically, e.g: in JS, as far as I know, it's impossible to use `/[A-Z]/ + /[a-z]/` to join two regex or even insert one into an other. — DadiBit, Oct 27 '20 at 19:18
A character class is not a macro. You could (in theory) define a custom character class containing exaclty one character, but there's no way for a Posix character class to represent anything other than a set of characters, so "a sequence matched by `qdtext`", for example, is way outside of the concept. — rici, Nov 01 '20 at 18:30

score 1 · Accepted Answer · answered Oct 31 '20 at 21:53

You need a kind of a metalanguage and some compiler for it. It is not a task for just C++ preprocessor or/and compiler's constant folding or other compile-stage features.

With the metalanguage you will describe your variant of extended RE. Then your compiler will parse that and generate some input for the main project - either just a set of strings to be used as input for the conventional RE, or something more smart and complex.

Tools for your task do exist: http://www.nongnu.org/bnf/, flex/bison, etc. They allow you not only to produce just some set of RE-strings, but to create the whole parser for your metalanguage (you have asked for optimization) - if such a concept is allowed for your project.

Or you can write your own parser from scratch.

Thank you! It's my first time hearing about BNF parsing libraries :P — DadiBit, Nov 01 '20 at 10:35

Defining new POSIX-like character class names in C/C++

1 Answers1