I have a program which grabs html, specifically quarterly reports from SEC.gov, using libcurl's WRITEFUNCTION to hold them in memory.
I now want to "read through" the HTML of the reports, storing many (many) desired values, basically anything in the financial or balance sheet tables. These would have identifying substrings within the document of various lengths.
Which (if any) of the following would be applicable here:
Boost::regex - search for a set of expressions and store next value found upon finding them
Libxml++ (or some equivalent) - form a DOM tree and write a method which traverses it's nodes, storing data when the node is of a certain type or contains a certain string ("Net Revenue" for example).
Or suggest some other library or methodology with the capability i'm looking for?