TokenizerFunction is a functor that has two methods, neither of which should be very difficult to implement. The first is reset
, which is meant to reset any state the functor might have, and the other is operator()
, which takes three parameters. The first two are iterators, and the third is the resulting token.
The algorithm below is simple. First, we skip any spaces. We expect the first non-space character to be one of three kinds. If it's a quotation mark or left parenthesis, then we search until we find the corresponding closing delimiter and return what we find as the token, taking care that quotation marks are supposed to be stripped, but parentheses, apparently, are to remain. If the first character is something else, then we search to the next delimiter and return that instead.
template <
typename Iter = std::string::const_iterator,
typename Type = std::string
>
struct QuoteParenTokenizer
{
void reset() { }
bool operator()(Iter& next, Iter end, Type& tok) const
{
while (next != end && *next == ' ')
++next;
if (next == end)
return false; // nothing left to read
switch (*next) {
case '"': {
++next; // skip token start
Item const quote = std::find(next, end, '"');
if (quote == end)
return false; // unterminated token
tok.assign(next, quote);
next = quote;
++next;
break;
}
case '(': {
Iter paren = std::find(next, end, ')');
if (paren == end)
return false; // unterminated token
++paren; // include the parenthesis
tok.assign(next, paren);
next = paren;
break;
}
default: {
Iter const first = next;
while (next != end && *next != ' ' && *next != '"' && *next != '(')
++next;
tok.assign(first, next);
}
}
return true;
}
};
You'd instantiate it as tokenizer<QuoteParenTokenizer<> >
. If you have a different iterator type, or a different token type, you'll need to indicate them in the template parameters to both tokenizer
and QuoteParenTokenizer
.
You can get fancier if you need to handle escaped delimiter characters. Things will be trickier if you need parenthesized expressions to nest.
Beware that as of right now, the above code has not been tested.