C++ routine for processing outer HTML

Question

I'm looking for some help in making a routine that returns the element type, all classes, and an id if there is one, and the child elements. For instance, if I have

<div class="someClass someOtherClass" id="someID"><p>Here's a line</p><p>Here's another line</p></p></div>

then I'm processing the string

"div class=\"someClass someOtherClass\" id=\"someID\"<p>Here's a line</p><p>Here's another line</p></p></div>"

and wanting to get the element div, the classes someClass, someOtherClass, the id someID, and the two p elements that are the children. So the setup looks like

node * process_tag(const std::string & outerHTML) 
{
   node * retNode = new node;
   // ...
   return retNode;
}

where a node is defined by

struct node
{
    std::string element_type;
    std::vector<std::string> class_list;
    std::string iden;
    std::vector<node*> children;
};

Is there an easy way to do this or am I going to stay up late trying to figure it out tonight?

And you're not using an XML library because...? If you want to use a library recommendations are off-topic on S.O., if you don't your question's too broad... there are lots of tiny details of HTML conventions that make robust parsing too hard to explain in one answer here. — Tony Delroy, Oct 29 '14 at 05:54
The easy way: use an html parser. Or even an xml parser if your html is well-enough behaved (though they usually aren't.) Or pass it through `tidy` first, asking for xhtml output. Then you can use an xml parser library, like `expat`. That's my usual approach for C; if it were python, the only answer would be BeautifulSoup. — rici, Oct 29 '14 at 05:58
A little out-dated, but [**a decent starting point**](http://stackoverflow.com/questions/9387610/what-xml-parser-should-i-use-in-c) — WhozCraig, Oct 29 '14 at 07:26

C++ routine for processing outer HTML

0 Answers0