What design pattern should I use for a function that parses HTML attributes? Is this a job for Regex?

Question

I'm wondering if you guys can help me start this out. I have a function that is defined as follows:

bool HtmlProcessor::_hasNextAttribute(std::string::iterator & it1, const std::string::iterator & it2, const std::pair<std::string, std::string> attrHolder)
{
      /* Parses the first HTML attributes in the iterator range [it1, it2), adding them to attrHolder; eg.

         "class="myClass1 myClass2" id="myId" onsubmit = "myFunction()""

         ----------  _hasNextAttribute  -------->

         attrHolder = ("class", "myClass1 myClass2")

         When the function terminates, it1 will be the iterator to the last character parsed, will be equal to 
         it2 if no characters were parsed.

      */

}

In other words, it looks for the first pattern of

[someString][possibleWhiteSpace]=[possibleWhiteSpace][quotationMark][someOtherString][quotationMark]

and puts that in a pair (someString, someOtherString).

What sort of algorithm should I be using to do this elegantly?

Bonus question:

Where I use the function,

while (_hasNextAttribute(it1, it2, thisAttribute))

I am getting a compiler error

Non-const lvalue reference to type '__wrap_iter<pointer>' cannot bind to a value of unrelated type '__wrap_iter<const_pointer>'

Any idea why that might be?

Are you trying to summon [Zalgo](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)? — user657267, Nov 05 '14 at 04:06

score 0 · Answer 1 · answered Nov 05 '14 at 16:14

Regular expressions can be useful to parse well-structured input. When taking input from users, I find it more flexible to use my custom reading functions.

The example below returns whether valid attribute following your pattern was found. If so, the first iterator is advanced beyond that attribute and the name and value are stored in the pair. (The pair should be a reference, so that changes are reflected.) If not, the iterator stays as it is. If after reading all attributes the iterator is not the end of the string, not all input was parsed.

As is, the function emulates the behaviour of a specialised regular expression. (I've annotated the code with the sub-expressions it corresponds to.) But because you have complete control over the code, you could modify it and extend it. For example, you yould replace each occurrences of return false with an appropriate error code so you can generate good error messages.

Anyway, here goes:

#include <iostream>
#include <string>

bool nextAttribute(std::string::iterator &iter, 
    const std::string::iterator &end, 
    std::pair<std::string, std::string> &attr)
{
    std::string::iterator it = iter;
    std::string::iterator start;

    while (it != end && isspace(*it)) ++it;     // \s*
    if (it == end) return false;

    start = it;                                 // (
    while (it != end && isalnum(*it)) ++it;     //   \w+
    if (it == start) return false;
    attr.first = std::string(start, it);        // )

    while (it != end && isspace(*it)) ++it;     // \s*
    if (it == end) return false;
    if (*it != '=') return false;               // =
    ++it;

    while (it != end && isspace(*it)) ++it;     // \s*
    if (it == end) return false;
    if (*it != '"') return false;               // "
    ++it;

    start = it;                                 // (    
    while (it != end && *it != '"') ++it;       //   [^"]*
    if (it == end) return false;    
    attr.second = std::string(start, it);       // )
    ++it;

    while (it != end && isspace(*it)) ++it;     // \s*
    iter = it;   

    return true;
}



int main()
{   
    std::string str("class=\"big red\" id=\"007\" onsubmit = \"go()\"");
    std::pair<std::string, std::string> attr;
    std::string::iterator it = str.begin();

    while (nextAttribute(it, str.end(), attr)) {
        std::cout << attr.first << ": '" << attr.second << "'\n";
    }

    if (it != str.end()) {
        std::cout << "Incomplete: " 
            << std::string(it, str.end()) << "\n";
    }

    return 0;
}

score 0 · Answer 2 · answered Nov 05 '14 at 16:29

I'd suggest a top-down approach:

Locate the first = character which separates the attribute name from the attribute value.
Locate the first non-whitespace character preceding the = character.
Locate the first " character following the =
Locate the second " character following the first ".

The attribute name is everything from the beginning to the first non-whitespace character you found in step 2. The attribute value is everything between the two quotation marks you found in 3. and 4.

That being said, I'd not recommend dealing with iterators into std::string objects: the whole std::string API is built around indices, e.g. std::find_last_not_of (which is useful for implementing step 2. above) takes an integer.

What design pattern should I use for a function that parses HTML attributes? Is this a job for Regex?

2 Answers2