Parse simple html with pure C++

Question

In my application I need to parse simple HTML code without using as less as possible external libs. My HTML looks like

<p> First Content is P </p><h2>Header</h2><p> Text under header </p>
<h2>Header 2</h2><p> Paragraph </p>
<h3>yep</h3><p> no </p>

My html contains only the tags p, h2, h3. I got the following structure:

struct Elements {
    std::string tag;
    std::string content;
};

std::vector<Elements> elems;

So my goal is after parsing each Elements in the vector should contain data like this:

tag = "h2"
content = "Header"

and

tag = "p"
content = "First Content is P"

PP: I need to get the elements in the order they're presented in the HTML.

Edit:

I just did this in javascript and it's working fine, but I have basically no idea how to write it down in c++:

var a = "<p> First Content is P </p><h2>Header</h2><p> Text under header </p>" +
    "<h2>Header 2</h2><p> Paragraph </p>" +
    "<h3>yep</h3><p> no </p>";

var output = [];

a.replace(/<\b[^>]*>(.*?)<\/(.*?)>/gmi, function(m, key, value) {
    output.push({
        tag: value,
        data: key
    });
})

/*
    output:
        { tag: "p", data: "First Content is P"},
        { tag: "h2", data: "Header" }
        .....
 */

Do you mean HTML or XHTML? HTML allows for tags to have no closing element etc... — rhughes, Apr 15 '14 at 00:37
Two things you nearly never author yourself: Crypto, and HTML/XML parsers. [You may find this interesting](http://stackoverflow.com/questions/2912165/fast-lightweight-html-parser-for-c) — WhozCraig, Apr 15 '14 at 00:39
All of my elements have open and closing tag. Since I got only 3 tags and I can hardcode them and I thought can this be realised via some kind of Regex? — Deepsy, Apr 15 '14 at 00:41
Where does one get such simple and limited HTML from in the first place? — Matti Virkkunen, Apr 15 '14 at 00:42
Only try to parse HTML with regex if you *know for certain* that the HTML document will *never ever* get any bigger or more complex. [You have been warned.](http://stackoverflow.com/a/1732454/420683) — dyp, Apr 15 '14 at 00:45
"I thought can this be realised via some kind of Regex?" - so you *do* know how to start... why don't you give that a go and tell us where you get stuck? (Obligatory echo of all warnings above). Seriously though, if you're not ready to do it yourself, you have to use a library. Whatever solution you can write yourself (anytime soon) is likely to be very fragile. — Tony Delroy, Apr 15 '14 at 01:15
yes regex is part of standard c++ library and this can be used to parse and get your work done. — Mantosh Kumar, Apr 15 '14 at 01:37
Seems to me quite an over kill, you should really consider integrating a 3rd party library/parser. — 101010, Apr 15 '14 at 01:46
@MantoshKumar you are suggesting bad habits. I hope you did not put that in your book. Regex's should not be used for parsing HTML.. lol: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — Brandon, Apr 15 '14 at 02:04
@CantChooseUsernames: Bjarne Stroustrup has explained how regex in c++11 can be used to parse XML(tag/end-of-tag markers) in his TC++PL 4th Edition.You can check the the answer by SteveJessop below where he has mentioned about uses of regex.I agree that there could be limits of regex while parsing these but certainly its not the bad habit to suggest. I hope you understand what i was trying to convey to SO. — Mantosh Kumar, Apr 15 '14 at 03:30
This is a perfectly legitimate mini project for learning to use C++. Google 'Lexer', and 'Parser' (which is overkill, but gives you an idea. If you happen to have stroustrup's book, and editions haven't changed this, he used to have a polish reverse notation simple arithmetic expression parser in chapter 6, which handles a more complicated parse, but gives you an idea (you won't need recursive descent from how you described your language though). — gnometorule, Apr 15 '14 at 04:00
I added a demo of what I'm trying to archieve, but in Javascript. After reading http://www.cplusplus.com/reference/regex/ I couldn't realise how I can parse the data into the vector. — Deepsy, Apr 15 '14 at 15:34

Steve Jessop · Accepted Answer · 2014-04-15T03:38:45.983

There are only those three elements, and no missing close tags. It looks as if furthermore there are no attributes on the tags, and aren't even any elements inside elements. There's no whitespace inside tags either.

Then you are not parsing HTML. You are parsing a special language that is a subset of HTML (well, not even really a subset since your document doesn't validate).

You might have a good reason not to want to use an HTML parser to parse this special language. For example, the code for a full HTML parser is large-ish and perhaps wouldn't otherwise need to be on the very tiny embedded device you're writing for. More likely this is a learning assignment, and the goal is for you to manipulate strings not to choose the best tool to produce the output you need. I will assume that you must avoid using an HTML library without further consideration why.

So, how to parse this special language? How to parse anything. Given all the restrictions I have listed above, you could do it very simply:

Look for the first instance in the string of any one of three substrings <p>, <h2>, <h3>. This is your opening tag.
Find the first instance of the corresponding close tag.
Everything between is the contents of the element. In your example you additionally trim whitespace at each end of the content. Construct an Elements object and add it to your vector (btw consider using a singular class name, not plural).
Repeat on the remainder of the string.

That's it. You could do that using a regular expression, but my general feeling is that since you said you wanted to do it in C++ then you may as well just do it in C++. No need to bring another language into it, and whatever the merits and limits of regexes, they certainly are another language.

However, maybe the extra limits I listed above aren't guaranteed. What if you later want to support spaces inside tags? And attributes? And XML namespaces? And comments? Then you'll wish you'd just used an HTML parser. Therefore what you do for a fixed trivial subset of HTML is different from what you do for a significant subset or one that might become significant in future.

Sergey Prokhorov · Answer 2 · 2014-04-15T05:29:04.187

Just a suggestion. To speedup parser, change struct Elements to something like

struct Node { const char * ptrToNodeStart; int nodeLen; Entity() ... etc}

struct Elements {
Node tag;
Node content; };

The main idea is to avoid memory allocation for tags and content because you already have whole document in memory. Just keep it there and operate with pointers. It is much faster. With pointers, parsing procedure will end up before single allocation completed. When your parser runs through the document, it will create new Node (will take from preallocated pool) and will put current ptr to Node::ptrToNodeStart. When new node occured (or current is closed) you fix Node::nodeLen and complete with Element. This is the idea. Serious problem with struct Elements, it does not fit to HTML structure because HTML node normally includes other nodes, so it requires Elements to be nested. Parsing HTML is interesting task even there are tons of parsers already on the market. Good luck.

In my HTML code I don't have nested tags. Thanks for the suggestion tho, I will try to do it now. — Deepsy, Apr 15 '14 at 08:36

Parse simple html with pure C++

2 Answers2

Linked