1

I'd like to parse content like:

tag = value
tag2 = value2
tag3 = value3

with the relaxation of allowing values over multiple lines and disregarding comments of the next tag. A tag is identified by not starting with the comment identifier '#' and starting at a new line. So this:

tag = value
  value continuation
tag2 = value2
  value continuation2
# comment for tag3
tag3 = value3

should parse the mapping:

tag : "value\nvalue continuation"
tag2 : "value2\nvalue continuation2"
tag3 : "value3"

How can I achieve this in a clean way? My current code for parsing one-line pairs looks sth like this:

while( std::getline( istr, line ) )
{
  ++lineCount;
  if( line[0] == '#' )
    currentComment.push_back( line );
  else if( isspace( line[0]) || line[0] == '\0' )
    currentComment.clear( );
  else
  {
    auto tag = Utils::string::splitString( line, '=' );
    if( tag.size() != 2 || line[line.size() - 1] == '=')
    {
      std::cerr << "Wrong tag syntax in line #" << lineCount << std::endl;
      return nullptr;
    }
    tagLines.push_back( line );
    currentComment.clear( );
  } 
}

Note that I don't require the results being stored in the types of containers that are currently used. I can switch to anything that fits better unless I get sets of (comment, tagname, value).

user1709708
  • 1,557
  • 2
  • 14
  • 27
  • Can you add a "line delimiter" like C++ has? If you can add something to actually mark the end of the expression like a `;` then you could use `getline()` and specify the `;` as the delimiter. – NathanOliver Aug 04 '15 at 11:48
  • @user1709708 Are your tags always one word? Can comments ever be placed in the middle of a pair like: "tag2 = value2\n# comment for tag3\n\tvalue continuation"? Are value continuations always indented and are pair starts never indented? – Jonathan Mee Aug 04 '15 at 12:27
  • Tags are always one word, comments in the middle of pairs are currently parsed as being part of the tag/value. So this doesn't need to be supported unless it's very easy. Yes, continuations are always indented whereas pairs aren't. That's how we differentiate the two at the moment. Again, unless there is a cleaner and easier solution I'm fine with leaving this as is. The line delimiter I consider as too big of an impact since this would influence all older files we've used so far. – user1709708 Aug 05 '15 at 09:40

1 Answers1

0

Generally regexs add complexity to your code, but in this case it seems a regex would be the best solution. A regex like this will capture the first and second parts of your pair:

(?:\s*#.*\n)*(\w+)\s*=\s*((?:[^#=\n]+(?:\n|$))+)

[Live example]

In order to use a regex_iterator on an istream you'll need to either slurp the stream or use boost::regex_iterator with the boost::match_partial flag. Say that istream has been slurped into string input. This code will extract the pairs:

const regex re("(?:\\s*#.*\\n)*(\\w+)\\s*=\\s*((?:[^#=\\n]+(\\n|$))+)");

for (sregex_iterator i(input.cbegin(), input.cend(), re); i != sregex_iterator(); ++i) {
    const string tag = i->operator[](1);
    const string value = i->operator[](2);

    cout << tag << ':' << value << endl;
}

[Live example]

This obviously exceeds the request in the original question; parsing out tags and values instead of just grabbing the line. There is a fair amount of functionality here that is new to C++, so if there are any questions please comment below.

Community
  • 1
  • 1
Jonathan Mee
  • 37,899
  • 23
  • 129
  • 288