1

I am wanting to turn a std::string such as:

"{1, 2}, {one, two}, {123, onetwothree}"

Into a std::vector of std::pairs of std::strings which would look something like:

std::vector<std::pair<std::string, std::string>> v = {{"1", "2"}, {"one", "two"}, {"123", "onetwothree"}};
// where, for instance
v[0] == std::make_pair("1", "2"); // etc.

This seems like a case where the original std::string could be parsed most easily using std::regex, but I am NOT a regex expert (or novice), let alone a std::regex expert. Any ideas for a recipe here?

crashmstr
  • 28,043
  • 9
  • 61
  • 79
DiB
  • 554
  • 5
  • 19
  • 4
    And you have tried what so far? – ScarletAmaranth Nov 27 '13 at 15:52
  • What about `"{1, {2, 3} { {a, b} } }"`? – Kerrek SB Nov 27 '13 at 15:55
  • Right now, I am using std::string's own find methods to do this manually. Not being a regex person, this did seem like a good case to get started, but examples I've found tend to return contents of a single brace match, but not iterate through multiple matches of brace matches. – DiB Nov 27 '13 at 16:07
  • As for the format, it is pretty strict with "{first, second}, {third, fourth}, etc." It's a list of pairs of strings, if that makes it more clear. – DiB Nov 27 '13 at 16:08
  • The example I found at http://stackoverflow.com/questions/13227802/c-regex-match-content-inside-curly-braces is on the right track, but far too trivial. I just don't know how to expand it to this more complex case. I'd be fine with some hybrid string/regex solution if that is best. – DiB Nov 27 '13 at 16:13
  • I'm not sure about C++, but generally regex libraries don't allow to parse such things. You usually can extract only constant number of substrings. – zch Nov 27 '13 at 16:18
  • This may help: http://www.cplusplus.com/reference/regex/regex_search/ – Zac Howland Nov 27 '13 at 16:21
  • I don't know if you even want to parse the entire string in a single regex. It may be possible using some of the more arcane features of PCRE, but it essentially just pushes your program's complexity from the code into the regex, and that's not necessarily a good thing. Parsing one pair of braces at a time and looping over it would probably make for much cleaner code. – Tim Pierce Nov 27 '13 at 16:24
  • Ok. That makes sense. I was hoping to buy some stability by shifting the parsing burden into the standard library, but perhaps my more iterative current solution is best? Thanks for the fresh view! – DiB Nov 27 '13 at 16:28
  • `\{([^{},]+),\s*([^{},])\}` is likely the regex you're looking for, you can do what qwrrty said for the structure. Be sure to double your backslashes when you go to string form. – FrankieTheKneeMan Nov 27 '13 at 16:29

2 Answers2

3

Currently, <regex> doesn't work well with GCC, here is a boost version, compiled with -lboost_regex.

boost capture fits this case, but it's by default not enabled.

Here is the original post: Boost C++ regex - how to get multiple matches

#include <iostream>
#include <string>
#include <boost/regex.hpp>

using namespace std;

int main()
{
  string str = "{1, 2}, {one, two}, {123, onetwothree}";

  boost::regex pair_pat("\\{[^{}]+\\}");
  boost::regex elem_pat("\\s*[^,{}]+\\s*");

  boost::sregex_token_iterator end;

  for(boost::sregex_token_iterator iter(str.begin(), str.end(), pair_pat, 0);
      iter != end; ++iter) {

    string pair_str = *iter;
    cout << pair_str << endl;

    for (boost::sregex_token_iterator it(pair_str.begin(), pair_str.end(), elem_pat, 0);
         it != end; ++it)
      cout << *it << endl;
  }

  return 0;
}
Community
  • 1
  • 1
gongzhitaao
  • 6,566
  • 3
  • 36
  • 44
1

The match pattern is pretty simple: "\{\s*(\w+)\s*\,\s*(\w+)\s*\}" so we just need to loop through and assemble all the matches. C++11 makes this pretty straight forward. Give this a shot:

std::string str = "{1, 2}, {one, two}, {123, onetwothree}";
std::vector<std::pair<std::string, std::string>> pairs;
std::regex exp(R"(\{\s*(\w+)\s*\,\s*(\w+)\s*\})");
std::smatch sm;
std::string::const_iterator cit = str.cbegin();
while (std::regex_search(cit, str.cend(), sm, exp)) {
    if (sm.size() == 3) // 3 = match, first item, second item
        pairs.emplace_back(sm[1].str(), sm[2].str());
    // the next line is a bit cryptic, but it just puts cit at the remaining string start
    cit = sm[0].second;
}

EDIT: Explanation on how it works: it matches one pattern at a time, using a constant iterator to point at the remainder after each match:

{1, 2}, {one, two}, {123, onetwothree}
^ iterator cit
-- regex_search matches "{1, 2}" sm[1] == "1", sm[2] == "2"

{1, 2}, {one, two}, {123, onetwothree}
      ^ iterator cit
-- regex_search matches "{one, two}" sm[1] == "one", sm[2] == "two"

{1, 2}, {one, two}, {123, onetwothree}
                  ^ iterator cit
-- regex_search matches "{123, onetwothree}" sm[1] == "123", sm[2] == "onetwothree"

{1, 2}, {one, two}, {123, onetwothree}
                                      ^ iterator cit
-- regex_search returns false, no match
Sam Cristall
  • 4,328
  • 17
  • 29
  • I put this in and it worked exactly like I wanted. I guess my biggest question here is in understanding how the regex knows to look for N-number of matches. I was playing with other ideas based on searches and other's comments, and I was having a hard time getting more than the outer set of {} matched. – DiB Nov 27 '13 at 16:44
  • AH! So the regex is just matching one at a time, but it's the string iterator that is marching down the string looking for any matches that follow. Thanks for the extra explanation there. (I should have looked at how the iterator was being used more closely!) – DiB Nov 27 '13 at 16:52
  • I added an explanation on how it works -- the regex is not matching N matches, but rather the first match. We then use an iterator to iterate over the remainder. Doing this in a loop gets us every match. – Sam Cristall Nov 27 '13 at 16:52
  • sounds a lot like what [std::regex_token_iterator](http://en.cppreference.com/w/cpp/regex/regex_token_iterator) does – Cubbi Nov 27 '13 at 17:30