2

I have a partial html string, and given the position of an opening tag, I would like to be able to find the position of the matching closing tag. I can't use an html parser (at least I don't think I can) because the html is just a snippet, and not complete html. There may be mismatched tags before or after the part I'm looking at. The string does not include the dtd, html, head or body tags.

For example:

<div id='something' class='someclass'>
  <h1>Title</h1>
  <div><p>some text</p></div>
  <div>
    <div class='anotherdiv'>
    </div>
    <div class='yetanother'>
    </div>
  </div>
</div>

(Position numbers are the < at the start of a particular tag)
Given a position of 0 (beginning if string), I would want to get the content:

  <h1>Title</h1>
  <div><p>some text</p></div>
  <div>
    <div class='anotherdiv'>
    </div>
    <div class='yetanother'>
    </div>
  </div>

Given a position of 39 (beginning of h1 on second line), I would want to get the content:

Title

Given a position of 83 (beginning of the div on line 4), I would want to get the content:

    <div class='anotherdiv'>
    </div>
    <div class='yetanother'>
    </div>

I've tried several methods so far. First, I've used strpos to locate a matching closing tag, then looked to see if there was another opening tag between the starting point and the closing tag. If found, I look for the next matching closing tag. Quite messy.

I then tried searching for the next matching opening tag (tag name with a "<" in front), then checking to see if there was a closing tag in between. Also quite messy.

Lastly, I started with the tag at the specified position, and built a list (stack) of opening and closing tags -- pushing the tag name on an opening tag and popping the tag name (if it matches) on a matching closing tag until the stack had one item matching the starting tag. With each operation, I keep track of the position so I end up with the start position (character following the > in the start tag), and the end position (the character before the closing tag's < character).

It ignores mismatched closing tags. For example, if there's an opening p tag, then an opening b tag, then it finds the closing /p tag without a closing b tag, it drops the b tag from the list. Similarly, if it finds a closing tag that isn't in the stack, it ignores it. Example:

<p><b>some text</p></b>

Both the <b> and </b> are ignored.

This last method seems to be the best idea, but I'm wondering if anyone else has a better idea.

I'm not looking for someone to write the code. I can do that. I'm looking for a concept/idea to use. If my last idea above is the best, I'd love to hear that too.

If it's a bad idea, or I'm way out in left field, I want to hear that too, but would appreciate if you can explain why and offer a better, more sane way to do it.

I guess what I'm really looking for a "reality" check to be sure I'm not over complicating the solution.

Thanks in advance!

Sloan

RST
  • 3,899
  • 2
  • 20
  • 33
Sloan Thrasher
  • 4,953
  • 3
  • 22
  • 40
  • 1
    Maybe [this question](http://stackoverflow.com/questions/701166/can-you-provide-some-examples-of-why-it-is-hard-to-parse-xml-and-html-with-a-reg) provides some insight... – andy Sep 20 '14 at 21:37
  • 1
    Most actual parsers can be configured to 'recover' from 'bad' HTML and/or XML (it being a fragment). I'd see if that is the case for you first. If that's not possible, you can use the basic [`XML Parser`](http://php.net/manual/en/book.xml.php#book.xml), which is a bit more verbose in it's use but does care little about incomplete documents / fragments. – Wrikken Sep 20 '14 at 21:57
  • It's almost certainly the wrong approach. What are you really trying to do? – pguardiario Sep 20 '14 at 22:18
  • Given the position within a string that points to the start of a tag, I want the string that falls inbetween the start of that tag, and it's matching end tag. See the examples above. The tag may or may not be unique, it may or may not have an ID or any other attributes. The position in the string is provided by a user clicking on a particular location in the string. It is not located using a CSS selector or other similar method. – Sloan Thrasher Sep 21 '14 at 10:29
  • Wrikken -- So, for instance, if I stripped all of the characters before the pointer, so the string begins with the tag of interest, and I pass it to a parser, will I be able to reliably retrieve the string enclosed by the tag? Thanks! – Sloan Thrasher Sep 21 '14 at 10:31

2 Answers2

0

What about make a whole browse of your string char by char like that :

Assuming string is named s.

int counter = 0;
bool simpleQuote = false;
bool doubleQuote = false;

int lastOpeningBraquetPosition = 0;
int lastClosingBraquetPosition = 0;

for (int i = 0; i < s.size(); i++)
{
  char c = s[i];
  if (c == "\"") 
    doubleQuote = !doubleQuote;
  if (c == "'") 
    simpleQuote = !simpleQuote;

  if ((c == "<") && (!doubleQuote) && (!simpleQuote))
  {
    //the car interest us
    counter++;
    //we save the position of the last "<"
    lastOpeningBraquetPosition = i;
  }

  if ((c == ">") && (!doubleQuote) && (!simpleQuote))
  {
    //the car interest us
    counter--;
    if (counter == 0)
    {
       //TODO : take the interesting part between lastClosingBraquetPosition + 1 and lastOpeningBraquetPosition - 1 with check to ensure to be in the string
       return result;
    }
    //we save the position of the last ">"
    lastClosingBraquetPosition = i;
  }
}

I havn't compile that code but the phylosophy is here.

You look char by char searhing < and > only outside of string (TODO : manage the \") You increase a counter each time you find a < and you decrease it each time you find a >. You save the last < and > position to extract interesting part.

lgm42
  • 591
  • 1
  • 6
  • 29
0

I solved my issue by writing a pseudo parser. It is really basic and starts with the tag at an indicated position. It steps through the string, identifying each tag and closing tag. It also watches for a self-closing tag (ie. ). For each opening tag, it pushes it onto a stack, and for each closing tag, if it matches the last opening tag, it pops it off of the stack. When it pops the last matching tag off of the stack, it has found the matching closing tag for the starting tag.

As it works, it keeps track of the end of an opening tag and the start of the matching closing tag. This allows it to know the starting position and ending position of the string contained by the starting tag. I added some "smarts" to detect and handle miss-matched tags, but overall, it's as simple as described.

I'm using this to parse info on web pages, to locate and capture specific data. For example, I used it to convert a table of data into database records as part of a project to convert hand-entered html tables to database table records. It seems reasonably fast, parsing just over 10k records of 12 columns and inserting the data into a table in less than 0.1 seconds.

I choose this method over using a full html or xml parser because the starting position in many cases was based on an element following another element rather than being able to use a css selector. Determining the starting position would have been more difficult using a css selector for the particular html involved, but was easy to do using strpos with a know starting point to skip over some of the html that would have matched a selector for the desired element.

Sloan Thrasher
  • 4,953
  • 3
  • 22
  • 40