Tokenising text using boost regex

Question

I forget regular expressions faster then my mothers birthday. It is a major PITA. Anyhow I wanted a RE for parsing the HTTP response status line and have the sub-elements properly captured. I got this working :

  const boost::regex status_line("HTTP/(\\d+?)\\.(\\d+?) (\\d+?) (.*)\r\n");
  std::string status_test1("HTTP/1.1 200 hassan ali\r\n");

  boost::smatch what;
  std::cout << regex_match(status_test1,what, status_line, boost::match_extra) << std::endl;
  std::cout << what.size() << std::endl;

  BOOST_FOREACH(std::string s, what)
  {
    std::cout << s << std::endl;
  }

The 4th capture group is what I was fussing about, particularly tokenising the words. But I don't need it so my job is done. However, I'd still like to know how to tokenise a space seperated sentence that ends with a '\0' which results in a vector/array of stripped words.

I can't get the following fragment to work

  const boost::regex sentence_re("(.+?)( (.+?))*");
  boost::smatch sentence_what;
  std::string sentence("hassan ali syed ");

  std::cout << boost::regex_match(sentence,sentence_what,sentence_re, boost::match_extra) << std::endl;

  BOOST_FOREACH(std::string s, sentence_what)
  {
    std::cout << s << std::endl;
  }

it shouldn't match "hassan ali syed ", but it should match "hassan ali syed", and the capture group should output hassan ali syed (with newlines), but it outputs hassan syed syed (note, the space in the third syed<space>syed. I suppose capture groups can't deal with recursive entities ?

So, is there a clean way for specifying a tokenising task in PCRE syntax, that results in a clean token vector (without repetition --i.e., I don't want the nested group to try and strip the whitespace).

I know this isn't the right tool for the job, spirit / lexx or boost::tokenise is best, and I know it isn't the right way to go about it. in .net when doing screen scraping I'd find tokens in bodies of text by repeatedly applying a regular expression to the body till it ran out of tokens.

score 1 · Answer 1 · edited May 23 '17 at 10:34

This reminds me of a similar question, Capturing repeating subpatterns in Python regex.

If the number of space-separated words is limited to some maximum number of tokens, then you can just tack on a whole bunch of extra subpatterns, somewhat like:

"HTTP/(\\d+?)\\.(\\d+?) (\\d+?) ([^ ]+\s)?([^ ]+\s)?([^ ]+\s)?([^ ]+\s)?\n\r"

Which is of course, horrible.

If you wanted a nested group, I don't think this can be done without "repeated subpattern" support in your regexp implementation (see Python's nonstandard regex module as used in the linked question.) You are almost certainly better off doing this with elementary string functions - your local equivalent of string.split().

Note: I am not a C++ or Boost::regexp user. – Li-aung Yip Mar 27 '12 at 15:21 — Li-aung Yip, Mar 27 '12 at 15:21

score 0 · Answer 2 · answered Mar 27 '12 at 21:50

Boost might be able to do recursive groupings, not sure. I lean towards that it can't.
I only know of .NET that can do that.

You can design a single regex with two parts. First part captures specific groups, the second captures in a single group all the rest. You can then do another recursive regex on the second part captured.

Something like this:
(specific)(part)(to)(capture)(all the remaining text)

Then do a while( /(part)/ ) regex on the previous remaining text capture.

Here's how you could do it in boost -

const string status = "HTTP/1.1 200 hassan ali\r\n";

boost::regex rx_sentence ( "HTTP/(\\d+)\\.(\\d+)\\s+(\\d+)\\s*([^\\s]+(?:\\s+[^\\s]+)*)?.*" );
boost::regex rx_token ( "[^\\s]+" );

if ( boost::regex_match( status, what, rx_sentence) )
{
    std::cout << "\nMatched:\n-----------------\n" << "'" << what[0] << "'" << std::endl;

    std::cout << "\nStatus (match groups):\n-----------------" << std::endl;
    for (int i=1; i < 4; i++)
    {
        std::cout << i << " = '" << what[i] << "'" << std::endl;
    }
    std::cout << "\nTokens (search of group 4):\n-----------------" << std::endl;
    const string token_str = what[4];

    std::string::const_iterator start = token_str.begin();
    std::string::const_iterator end   = token_str.end();

    while ( boost::regex_search(start, end, what, rx_token) )
    {
        string token(what[0].first, what[0].second);
        cout << "'" << token << "'" << endl;
        start = what[0].second;
    }
}
else
    std::cout << "Didn't match" << std::endl;

Tokenising text using boost regex

2 Answers2