convert/compile regular expressions to C code

Question

I am on a memory limited system, boost::regex is too large. What options exist to compile my regular expression straight to C/C++ and how many KB of code size should I except? With the goal of reducing memory and code size as much as possible.

I am looking for under 100kb of code size and the same in memory usage. Boost regex appears to be approx 470kb which is too large.

How complex do your regular expressions need to be? In the book 'Beautiful Code', there are some simple regular expression functions that probably amount to a couple hundred bytes of code and an amount of stack space mostly controlled by the number of stars (`*`) that appear in the regular expressions. But these are very simple regexes. — Jonathan Leffler, Sep 20 '12 at 02:07
Regular expressions that will match different parts of the HTTP protocol, so more than just basic `*` and `+` — dongle26, Sep 20 '12 at 02:08
`(GET|POST|HEAD)[[:blank:]]+(?:([[:alpha:]]{1,6})://([^/[:blank:]]+))?(/[^[:blank:]]*)(?:[[:blank:]]+HTTP/([[:digit:]]{1,3})\\.([[:digit:]]{1,4}))?` for example. — dongle26, Sep 20 '12 at 02:12
OK; have you looked at POSIX [`regcomp()`](http://pubs.opengroup.org/onlinepubs/9699919799/functions/regcomp.html) et al? They should handle that, and probably in less than 100 KiB. They handle [BRE and ERE](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html). — Jonathan Leffler, Sep 20 '12 at 02:14
Is that available with uclibc? Where can I find the library if not? — dongle26, Sep 20 '12 at 02:17
Firstly, your compiler or OS may provide a reg exp library - you could try that. Alternatively, have a look at boost xpressive library. You could even recode your parsing in boost spirit if you're not wedded to regular expressions. — Tony Delroy, Sep 20 '12 at 02:18
Those sound like they increase code size (by a good amount) too. — dongle26, Sep 20 '12 at 02:21
How good does you I18N/L11N (G11N) support need to be? Do you need to support Unicode with your `[:alpha:]`, for example? Also, please edit your refined requirements into your question, and then delete the answers. You probably didn't realize what you needed to specify at first; that's why you get questions in the comments, but you can edit your question to put the extra information in there, and we can then clean up the comments (you remove yours; we remove ours). — Jonathan Leffler, Sep 20 '12 at 02:24
I've got news for you; Unicode-ish domains are coming to the rest of the world any time now... — Jonathan Leffler, Sep 20 '12 at 02:26
But they are encoded in punycode (ASCII escape sequence) are they not? — dongle26, Sep 20 '12 at 02:27
Not as I understood it, but I have not investigated in detail so I could easily be wrong. Do you know the relevant RFCs by any chance? — Jonathan Leffler, Sep 20 '12 at 02:30
No. But the HTTP standard states that header fields are generally ASCII. and I am almost positive that DNS uses punycode. — dongle26, Sep 20 '12 at 02:34
Yet another possibility would be one of [Henry Spencer's Regex libaries](http://www.arglist.com/regex). As I recall, the original ("Book") library compiled down to something like 15-20K, back when I paid close attention to things like that (I.e., under MS-DOS). — Jerry Coffin, Sep 20 '12 at 02:58

score 4 · Accepted Answer · answered Sep 20 '12 at 03:32

lex (and flex) produce table-driven lexers which are generally pretty small; they go back to the days when 100kB would have been considered a supercomputer :) The basic flex code skeleton is tiny (a few kB) and the tables depend on how many token types you have and how complicated the regular expressions are, but a simple flex scanner table are typically a few kB as well.

However, if you're not using them for building an interpreter/compiler, they do have a couple of annoying characteristics: first, they insist on doing your input and buffering for you, which is nice if you're always reading from a file but can be less cool if your input is coming from a socket or terminal (or, worse, being preprocessed by some kind of translator), and second they are designed for an environment where you have a few simple token types, and you have a parser which is responsible for interpreting the sequencing. (Hence yacc or bison.) You could use these tools to parse HTTP, certainly, and you might even find that you've learned some useful new skills.

There is a tool called re2c (i.e. regular expression to C) which you might find a little more comfortable. Unlike lex, it produces customized C code, which is quite a bit bulkier, but arguably runs slightly faster. I don't think it's being actively maintained, but I had quite a lot of success with it some years back. You should be able to find it on SourceForge.

Good luck.

score 2 · Answer 2 · answered Sep 20 '12 at 02:15

2

People seem to forget that this problem has been solved long time ago by lex and yacc.

answered Sep 20 '12 at 02:15

Nikolai Fetissov

82,306
11
110
171

Sort of, with `lex`, but not really. You compile the set of possible regexes with `lex`, but you can't adapt to new regexes at runtime with `lex`. – Jonathan Leffler Sep 20 '12 at 02:19
@Jonathan, yes, true, but who said anything about runtime? What's the use of C code at runtime? – Nikolai Fetissov Sep 20 '12 at 02:23
So you need to qualify your answer along the lines of: _if the set of regular expressions to be managed is fixed when the program is compiled, then you may be able to use `lex` to recognize those expressions._ – Jonathan Leffler Sep 20 '12 at 02:28
I don't want to build a compiler just to parse HTTP. I've looked at lex and yacc and can't figure them out. Also don't compilers use alot of memory (relatively speaking)? – dongle26 Sep 20 '12 at 02:30
1

Hmm, don't think so. The question is titled "convert/compile regular expressions to C code", and then says "What options exist to compile my regular expression straight to C/C++ ...". I don't see any ambiguity here. – Nikolai Fetissov Sep 20 '12 at 02:33
@dongle26, I'm sorry if these basic tools are too hard for you , but code produced by them is tiny and well optimized. – Nikolai Fetissov Sep 20 '12 at 02:34
And they parse regular expressions? Do they support partial matches? – dongle26 Sep 20 '12 at 02:41
Yes, but you have to look at the whole thing from a different angle - you create a grammar and react to rule matches, instead of calling an API and iterating over the results. – Nikolai Fetissov Sep 20 '12 at 02:58

Jasen · Answer 3 · 2019-02-07T01:27:50.813

re2c is an application designed to do exactly that

http://sourceforge.net/projects/re2c/

(also available as a debian package etc)

Licence: Public Domain

alternatively it may be possible to compile a regex to bytecode and link the interpreter part of pcre2 (or whichever regex style you want) only eg:

https://www.pcre.org/current/doc/html/pcre2api.html#SEC25

It is possible to save compiled patterns on disc or elsewhere, and reload them later, subject to a number of restrictions. The host on which the patterns are reloaded must be running the same version of PCRE2, with the same code unit width, and must also have the same endianness, pointer width, and PCRE2_SIZE type. Before compiled patterns can be saved, they must be converted to a "serialized" form, which in the case of PCRE2 is really just a bytecode dump. The functions whose names begin with pcre2_serialize_ are used for converting to and from the serialized form. They are described in the pcre2serialize documentation. Note that PCRE2 serialization does not convert compiled patterns to an abstract format like Java or .NET serialization.

So, to include precompiled regex for RCRE2 you may need to run the compile on the target system or under emulation.

convert/compile regular expressions to C code

3 Answers3

Linked