I forget regular expressions faster then my mothers birthday. It is a major PITA. Anyhow I wanted a RE for parsing the HTTP response status line and have the sub-elements properly captured. I got this working :
const boost::regex status_line("HTTP/(\\d+?)\\.(\\d+?) (\\d+?) (.*)\r\n");
std::string status_test1("HTTP/1.1 200 hassan ali\r\n");
boost::smatch what;
std::cout << regex_match(status_test1,what, status_line, boost::match_extra) << std::endl;
std::cout << what.size() << std::endl;
BOOST_FOREACH(std::string s, what)
{
std::cout << s << std::endl;
}
The 4th capture group is what I was fussing about, particularly tokenising the words. But I don't need it so my job is done. However, I'd still like to know how to tokenise a space seperated sentence that ends with a '\0' which results in a vector/array of stripped words.
I can't get the following fragment to work
const boost::regex sentence_re("(.+?)( (.+?))*");
boost::smatch sentence_what;
std::string sentence("hassan ali syed ");
std::cout << boost::regex_match(sentence,sentence_what,sentence_re, boost::match_extra) << std::endl;
BOOST_FOREACH(std::string s, sentence_what)
{
std::cout << s << std::endl;
}
it shouldn't match "hassan ali syed "
, but it should match "hassan ali syed"
, and the capture group should output hassan
ali
syed
(with newlines), but it outputs hassan
syed
syed
(note, the space in the third syed<space>syed
. I suppose capture groups can't deal with recursive entities ?
So, is there a clean way for specifying a tokenising task in PCRE syntax, that results in a clean token vector (without repetition --i.e., I don't want the nested group to try and strip the whitespace).
I know this isn't the right tool for the job, spirit / lexx or boost::tokenise is best, and I know it isn't the right way to go about it. in .net when doing screen scraping I'd find tokens in bodies of text by repeatedly applying a regular expression to the body till it ran out of tokens.