0

I have a program which grabs html, specifically quarterly reports from SEC.gov, using libcurl's WRITEFUNCTION to hold them in memory.

I now want to "read through" the HTML of the reports, storing many (many) desired values, basically anything in the financial or balance sheet tables. These would have identifying substrings within the document of various lengths.

Which (if any) of the following would be applicable here:

Boost::regex - search for a set of expressions and store next value found upon finding them

Libxml++ (or some equivalent) - form a DOM tree and write a method which traverses it's nodes, storing data when the node is of a certain type or contains a certain string ("Net Revenue" for example).

Or suggest some other library or methodology with the capability i'm looking for?

A-Sharabiani
  • 17,750
  • 17
  • 113
  • 128
  • You shall not use regular expressions. XML libraries can't parse HTML either, unless you first convert it to XHTML by tools like html tidy. – Yakov Galka Jan 06 '17 at 20:09
  • 1
    [Though shall not use regex to parse HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) – NathanOliver Jan 06 '17 at 20:11
  • The standard C++ language has not facilities for parsing HTML. You'll have to visit [softwarerecs.se] or search there for an **HTML** library.. – Thomas Matthews Jan 06 '17 at 20:21
  • Or check if the SEC has another API that can be used to query the desired data in a machine-parsable format, like XML, JSON, etc. Scaping HTML websites should be a *last* resort. – Remy Lebeau Jan 06 '17 at 20:31

0 Answers0