I have a partial html string, and given the position of an opening tag, I would like to be able to find the position of the matching closing tag. I can't use an html parser (at least I don't think I can) because the html is just a snippet, and not complete html. There may be mismatched tags before or after the part I'm looking at. The string does not include the dtd, html, head or body tags.
For example:
<div id='something' class='someclass'>
<h1>Title</h1>
<div><p>some text</p></div>
<div>
<div class='anotherdiv'>
</div>
<div class='yetanother'>
</div>
</div>
</div>
(Position numbers are the < at the start of a particular tag)
Given a position of 0 (beginning if string), I would want to get the content:
<h1>Title</h1>
<div><p>some text</p></div>
<div>
<div class='anotherdiv'>
</div>
<div class='yetanother'>
</div>
</div>
Given a position of 39 (beginning of h1 on second line), I would want to get the content:
Title
Given a position of 83 (beginning of the div on line 4), I would want to get the content:
<div class='anotherdiv'>
</div>
<div class='yetanother'>
</div>
I've tried several methods so far. First, I've used strpos
to locate a matching closing tag, then looked to see if there was another opening tag between the starting point and the closing tag. If found, I look for the next matching closing tag. Quite messy.
I then tried searching for the next matching opening tag (tag name with a "<" in front), then checking to see if there was a closing tag in between. Also quite messy.
Lastly, I started with the tag at the specified position, and built a list (stack) of opening and closing tags -- pushing the tag name on an opening tag and popping the tag name (if it matches) on a matching closing tag until the stack had one item matching the starting tag. With each operation, I keep track of the position so I end up with the start position (character following the > in the start tag), and the end position (the character before the closing tag's < character).
It ignores mismatched closing tags. For example, if there's an opening p tag, then an opening b tag, then it finds the closing /p tag without a closing b tag, it drops the b tag from the list. Similarly, if it finds a closing tag that isn't in the stack, it ignores it. Example:
<p><b>some text</p></b>
Both the <b>
and </b>
are ignored.
This last method seems to be the best idea, but I'm wondering if anyone else has a better idea.
I'm not looking for someone to write the code. I can do that. I'm looking for a concept/idea to use. If my last idea above is the best, I'd love to hear that too.
If it's a bad idea, or I'm way out in left field, I want to hear that too, but would appreciate if you can explain why and offer a better, more sane way to do it.
I guess what I'm really looking for a "reality" check to be sure I'm not over complicating the solution.
Thanks in advance!
Sloan