I'm currently working on a project in c++ to use regex as HTTP FRC rules. In the RFC 1945, Chapter 2.2 - Basic Rules there are the following rules:
CHAR = <any US-ASCII character (octets 0 - 127)>
CTL = <any US-ASCII control character (octets 0 - 31) and DEL (127)>
CR = <US-ASCII CR, carriage return (13)>
LF = <US-ASCII LF, linefeed (10)>
SP = <US-ASCII SP, space (32)>
HT = <US-ASCII HT, horizontal-tab (9)>
CRLF = CR LF
LWS = [CRLF] 1*( SP | HT )
word = token | quoted-string
token = 1*<any CHAR except CTLs or tspecials>
tspecials = "(" | ")" | "<" | ">" | "@"
| "," | ";" | ":" | "\" | <">
| "/" | "[" | "]" | "?" | "="
| "{" | "}" | SP | HT
quoted-string = ( <"> *(qdtext) <"> )
qdtext = <any CHAR except <"> and CTLs, but including LWS>
What I'm interested in is the usage of character classes like [:digit:]
or at least recycling the regex. The pseudo-code would become something like this (\
is already escaped and regex is already in string form):
CHAR = "\\x00-\\x7F"
CTL = "\\x00-\\x19\\x7F"
CR = "\\r"
LF = "\\n"
SP = "\\x20"
HT = "\\t"
//here I start "recycling" old regexes
CRLF = "[:CR:][:LF:]"
LWS = "[:CRLF:]* ( [:SP:] | [:HT:] )+"
//here a declaration might happen before using the token or quoted_string class
word = "[:token:] | [:quoted_string:]"
token = "( (?= [[:CHAR:]] ) [^[:tspecial:][:CTL:]] )+"
tspecials = "()<>@,;:\\\\\\"/\\[\\]?={}[:SP:][:HT:]"
quoted_string = " ( \\" ([:qdtext:])* \\" ) "
//Little trick to allow LWS but not CTLs: https://stackoverflow.com/a/18017758/9373031
qdtext = "(?=[[:CHAR:]]) ( [:LWS:] | [^\\"[:CTL:]] )"
What I tried so far is to store them as string and then chain them together with a +
, but looked ugly and not very optimized. Of course I could repeat some regexs but it started becoming an enormous monster the further I went.
I tried googling a while, but nor did I find anything about adding custom POSIX-like classes, neither did I find anything about recycling (and optimizing?) regexs.
What I need to do is to optimize and prettify regex originating string such that they could be parsed into a new one as POSIX-like classes or in some other way (code in C/C++):
std::regex CR ("\\r");
std::regex LF ("\\r");
std::regex CRLF ("[:CR:] [:LF:]");
Option 1:
[:CR:] [:LF:]
would be expanded to \\r \\n
and at compilation would become: std::regex CRLF ("\\r \\n");
Option 2:
[:CR:] [:LF:]
would be "expanded" as "two functions" to optimize regex at run-time.
So far I found std::ctype_base
has the static methods used for classnames in the std::regex_traits<CharT>::lookup_classname
function, that should be used for finding defined classnames: is it possible to extend the masks used?