I have a series of thousands of HTML files and for the ultimate purpose of running a word-frequency counter, I am only interested on a particular portion from each file. For example, suppose the following is part of one of the files:
<!-- Lots of HTML code up here -->
<div class="preview_content clearfix module_panel">
<div class="textelement "><div><div><p><em>"Portion of interest"</em></p></div>
</div>
<!-- Lots of HTML code down here -->
How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?
I currently have some code that opens the html file and reads the entire content into a single string, but when I try to run a boost::regex_match
looking for that particular beginning of line <div class="preview_content clearfix module_panel">
, I don't get any matches. I'm open to any suggestions as long as it's on c++.