Brian Kernighan provided a short article on A Regular Expression Matcher that Rob Pike wrote as a demonstration program for a book they were working on. The article is a very nice read explaining a bit about the code and regular expressions in general.
I have played with this code, making a few changes to experiment with some extensions such as to also return where in the string the pattern matches so that the substring matching the pattern can be copied from the original text.
From the article:
I suggested to Rob that we needed to find the smallest regular
expression package that would illustrate the basic ideas while still
recognizing a useful and non-trivial class of patterns. Ideally, the
code would fit on a single page.
Rob disappeared into his office, and at least as I remember it now,
appeared again in no more than an hour or two with the 30 lines of C
code that subsequently appeared in Chapter 9 of TPOP. That code
implements a regular expression matcher that handles these constructs:
c matches any literal character c
. matches any single character
^ matches the beginning of the input string
$ matches the end of the input string
* matches zero or more occurrences of the previous character
This is quite a useful class; in my own experience of using regular
expressions on a day-to-day basis, it easily accounts for 95 percent
of all instances. In many situations, solving the right problem is a
big step on the road to a beautiful program. Rob deserves great credit
for choosing so wisely, from among a wide set of options, a very small
yet important, well-defined and extensible set of features.
Rob's implementation itself is a superb example of beautiful code:
compact, elegant, efficient, and useful. It's one of the best examples
of recursion that I have ever seen, and it shows the power of C
pointers. Although at the time we were most interested in conveying
the important role of a good notation in making a program easier to
use and perhaps easier to write as well, the regular expression code
has also been an excellent way to illustrate algorithms, data
structures, testing, performance enhancement, and other important
topics.
The actual C source code from the article is very very nice.
/* match: search for regexp anywhere in text */
int match(char *regexp, char *text)
{
if (regexp[0] == '^')
return matchhere(regexp+1, text);
do { /* must look even if string is empty */
if (matchhere(regexp, text))
return 1;
} while (*text++ != '\0');
return 0;
}
/* matchhere: search for regexp at beginning of text */
int matchhere(char *regexp, char *text)
{
if (regexp[0] == '\0')
return 1;
if (regexp[1] == '*')
return matchstar(regexp[0], regexp+2, text);
if (regexp[0] == '$' && regexp[1] == '\0')
return *text == '\0';
if (*text!='\0' && (regexp[0]=='.' || regexp[0]==*text))
return matchhere(regexp+1, text+1);
return 0;
}
/* matchstar: search for c*regexp at beginning of text */
int matchstar(int c, char *regexp, char *text)
{
do { /* a * matches zero or more instances */
if (matchhere(regexp, text))
return 1;
} while (*text != '\0' && (*text++ == c || c == '.'));
return 0;
}