0

I am using VC++ 10 in a project. Being new to C/C++ I just Googled, it appears that in standard C++ doesnt have regex? VC++ 10 seems to have regex. However, how do I do a regex split? Do I need boost just for that?

Searching the web, I found that many recommend Boost for many things, tokenizing/splitting string, parsing (PEG), and now even regex (though this should be build in ...). Can I conclude boost is a must have? Its 180MB for just trivial things, supported naively in many languages?

Community
  • 1
  • 1
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
  • 1
    Standard C++ **does** have a [regex library](http://en.cppreference.com/w/cpp/regex). – Mankarse Oct 24 '12 at 12:27
  • Oh, ic, but its C++ 11, what compilers support that? Does VC++ which I am supposed to use for this sch assignment support it (properly)? – Jiew Meng Oct 24 '12 at 12:30
  • 1
    That said, I would have to agree that Boost is a must-have. If you do not use it where appropriate, you will just end up either reimplementing it or using an inferior solution. VC++10 supports the [tr1](http://en.wikipedia.org/wiki/C%2B%2B_Technical_Report_1) version of the regex library (see [here](http://msdn.microsoft.com/en-us/library/bb982382(v=vs.100).aspx)), which (afaik) is almost identical to the C++11 version. As for C++11, the latest versions of every major compiler support all of the main features, but it is not uniformly available (especially when targeting old platforms). – Mankarse Oct 24 '12 at 12:38
  • 1
    Before you get all warm-and-fuzzy about using regex with Visual Studio 2010 you may want to consider [this known-defect](http://connect.microsoft.com/VisualStudio/feedback/details/648543/tr1-regex-doesnt-match-a-valid-pattern-with-repetition) that MS essentially refused to fix, even in a patch, of 2010. Apparently it is fixed in the who-knew-there-was-a-2011 release, and I can only speculate it is fixed in 2012+. The defect deals with the repetiton counts following patterns, and there are few things MS has done (and not done) with the std lib that pissed me off more than this. – WhozCraig Oct 24 '12 at 12:56
  • 1
    What do you mean by "split"? Class std::string has many methods. May be you don't need regex in the first place. – SChepurin Oct 24 '12 at 13:00
  • @SChepurin, meaning `split("aaa bbb ccc", " ")` should return something like `["aaa", "bbb", "ccc"]` – Jiew Meng Oct 24 '12 at 13:28
  • @Jiew Meng - Just as i thought. See http://stackoverflow.com/questions/236129/splitting-a-string-in-c. And there are more ways to choose from after search. – SChepurin Oct 24 '12 at 14:01

2 Answers2

7

C++11 standard has std::regex. It also included in TR1 for Visual Studio 2010. Actually TR1 is available since VS2008, it's hidden under std::tr1 namespace. So you don't need Boost.Regex for VS2008 or later.

Splitting can be performed using regex_token_iterator:

#include <iostream>
#include <string>
#include <regex>

const std::string s("The-meaning-of-life-and-everything");
const std::tr1::regex separator("-");
const std::tr1::sregex_token_iterator endOfSequence;

std::tr1::sregex_token_iterator token(s.begin(), s.end(), separator, -1);
while(token != endOfSequence) 
{
   std::cout << *token++ << std::endl;
}

if you need to get also the separator itself, you could obtain it from sub_match object pointed by token, it is pair containing start and end iterators of token.

while(token != endOfSequence) 
{
   const std::tr1::sregex_token_iterator::value_type& subMatch = *token;
   if(subMatch.first != s.begin())
   {
      const char sep = *(subMatch.first - 1);
      std::cout << "Separator: " << sep << std::endl;
   }

   std::cout << *token++ << std::endl;
}

This is sample for case when you have single char separator. If separator itself can be any substring you need to do some more complex iterator work and possible store previous token submatch object.

Or you can use regex groups and place separators in first group and the real token in second:

const std::string s("The-meaning-of-life-and-everything");
const std::tr1::regex separatorAndStr("(-*)([^-]*)");
const std::tr1::sregex_token_iterator endOfSequence;

// Separators will be 0th, 2th, 4th... tokens 
// Real tokens will be 1th, 3th, 5th... tokens 
int subMatches[] = { 1, 2 };
std::tr1::sregex_token_iterator token(s.begin(), s.end(), separatorAndStr, subMatches);
while(token != endOfSequence) 
{
   std::cout << *token++ << std::endl;
}

Not sure it is 100% correct, but just to illustrate the idea.

Rost
  • 8,779
  • 28
  • 50
  • Thanks, just 1 missing part, can I get the string that was matched? eg. if I am matching `[=+{};]` I want to know what was the matched character – Jiew Meng Oct 24 '12 at 13:30
  • Great, but can you explain the `subMatches`? I dont get a better understanding looking at the docs: http://msdn.microsoft.com/en-us/library/bb982699.aspx :( – Jiew Meng Oct 25 '12 at 13:43
  • `subMatches` is array of indexes of regex groups that you need to iterate. E.g. if you have regex `(a)(b)(c)` and pass {2,3} to `token_iterator` ctor it will iterate only `(b)` and `(c)` groups treating them as separate tokens. `(a)` group matches will be skipped. – Rost Oct 25 '12 at 14:15
  • Also note that zero index corresponds to entire regex. – Rost Oct 25 '12 at 14:34
0

Here an example from this blog.

You'll have all your matches in res

std::tr1::cmatch res;
str = "<h2>Egg prices</h2>";
std::tr1::regex rx("<h(.)>([^<]+)");
std::tr1::regex_search(str.c_str(), res, rx);
std::cout << res[1] << ". " << res[2] << "\n";
vivek
  • 4,951
  • 4
  • 25
  • 33