1

I'm reading the documentation on std::regex_iterator<std::string::iterator> since I'm trying to learn how to use it for parsing HTML tags. The example the site gives is

#include <iostream>
#include <string>
#include <regex>

int main ()
{
  std::string s ("this subject has a submarine as a subsequence");
  std::regex e ("\\b(sub)([^ ]*)");   // matches words beginning by "sub"

  std::regex_iterator<std::string::iterator> rit ( s.begin(), s.end(), e );
  std::regex_iterator<std::string::iterator> rend;

  while (rit!=rend) {
    std::cout << rit->str() << std::endl;
    ++rit;
  }

  return 0;
}

(http://www.cplusplus.com/reference/regex/regex_iterator/regex_iterator/)

and I have one question about that: If rend is never initialized, then how is it being used meaningfully in the rit!=rend?

Also, is the tool I should be using for getting attributes out of HTML tags? What I want to do is take a string like "class='class1 class2' id = 'myId' onclick ='myFunction()' >" and break in into pairs

("class", "class1 class2"), ("id", "myId"), ("onclick", "myFunction()")

and then work with them from there. The regular expression I'm planning to use is

([A-Za-z0-9\\-]+)\\s*=\\s*(['\"])(.*?)\\2

and so I plan to iterate through expression of that type while keeping track of whether I'm still in the tag (i.e. whether I've passed a '>' character). Is it going to be too hard to do this?

Thank you for any guidance you can offer me.

2 Answers2

3

What do you mean with "if rend is never initialized"? Clearly, std::regex_iterator<I> has a default constructor. Since the iteration is only forward iteration the end iterator just needs to be something suitable to detect that the end is used. The default constructor can set up rend correspondingly.

This is an idiom used in a few other places in the standard C++ library, e.g., for std::istream_iterator<T>. Ideally, the end iterator could be indicated using a different type (see, e.g., Eric Niebler's discussion on this issue, the link is to the first of four pages) but the standard currently requires that the two types match when using algorithms.

With respect to parsing HTML using regular expression please refer to this answer.

Community
  • 1
  • 1
Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
2

rend is not uninitialized, it is default-constructed. The page you linked is clear that:

The default constructor (1) constructs an end-of-sequence iterator.

Since default-construction appears to be the only way to obtain an end-of-sequence iterator, comparing rit to rend is the correct way to test whether rit is exhausted.

user4815162342
  • 141,790
  • 18
  • 296
  • 355