Extract specific portion of HTML file using c++/boost::regex

Question

I have a series of thousands of HTML files and for the ultimate purpose of running a word-frequency counter, I am only interested on a particular portion from each file. For example, suppose the following is part of one of the files:

<!-- Lots of HTML code up here -->
<div class="preview_content clearfix module_panel">
      <div class="textelement   "><div><div><p><em>"Portion of interest"</em></p></div>
</div>
<!-- Lots of HTML code down here -->

How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?

I currently have some code that opens the html file and reads the entire content into a single string, but when I try to run a boost::regex_match looking for that particular beginning of line <div class="preview_content clearfix module_panel">, I don't get any matches. I'm open to any suggestions as long as it's on c++.

Obligatory reference: http://stackoverflow.com/a/1732454/1088 — aib, Oct 16 '12 at 00:46

score 1 · Answer 1 · answered Oct 16 '12 at 00:40

How should I go about using regular expressions in c++ (boost::regex) to extract that particular portion of text highlighted in the example and put that into a separate string?

You don't.

Never use regular expressions to process HTML. Whether in C++ with Boost.Regex, in Perl, Python, JavaScript, anything and anywhere. HTML is not a regular language; therefore, it cannot be processed in any meaningful way via regular expressions. Oh, in extremely limited cases, you might be able to get it to extract some particular information. But once those cases change, you'll find yourself unable to get done what you need to get done.

I would suggest using an actual HTML parser, like LibXML2 (which does have the ability to read HTML4). But using regex's to parse HTML is simply using the wrong tool for the job.

Yeah, after doing some more reading I now see that it seems like a bad idea to use regex when dealing with html. Thanks for pointing that out. Given that that is the only thing I want to do (i.e., get the content of that specific tag), what would you suggest I use? I've been looking around but most things I come across seem like a bit of an overkill. — Everaldo Aguiar, Oct 16 '12 at 02:06

score 1 · Accepted Answer · answered Oct 16 '12 at 03:43

Since all I needed was something quite simple (as per question above), I was able to get it done without using regex or any type of parsing. Following is the code snippet that did the trick:

    // Read HTML file into string variable str
    std::ifstream t("/path/inputFile.html");
    std::string str((std::istreambuf_iterator<char>(t)), std::istreambuf_iterator<char>());

    // Find the two "flags" that enclose the content I'm trying to extract
    size_t pos1 = str.find("<div class=\"preview_content clearfix module_panel\">");
    size_t pos2 = str.find("</em></p></div>");

    // Get that content and store into new string
    std::string buf = str.substr(pos1,pos2-pos1);

Thank you for pointing out the fact that I was totally on the wrong track.

Extract specific portion of HTML file using c++/boost::regex

2 Answers2