How to get everything between two HTML tags? (with XPath?)

Question

EDIT : I've added a solution which works in this case.

I want to extract a table from a page and I want to do this (probably) with a DOMDocument and XPath. But if you've got a better idea, tell me.

My first attempt was this (obviously faulty, because it will get the first closing table tag):

<?php 
    $tableStart = strpos($source, '<table class="schedule"');
    $tableEnd   = strpos($source, '</table>', $tableStart);
    $rawTable   = substr($source, $tableStart, ($tableEnd - $tableStart));
?>

I tough, this might be solvable with a DOMDocument and/or xpath...

In the end I want everything between the tags (in this case, the tags), and the tags them self. So all HTML, not just the values (e.g. Not just 'Value' but 'Value'). And there is one 'catch'...

The table has in it, other tables. So if you just search for the end of the table (' tag') you get probably the wrong tag.
The opening tag has a class with which you can identify it (classname = 'schedule').

Is this possible?

This is the (simplified) source piece that I want to extract from another website: (I also want to display the html tags, not just the values, so the whole table with the class 'schedule')

<table class="schedule">
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- a problematic tag...

    This could even be variable content. =O =S

</table>

Yes, use DOMDocument, like the splitting / merging XML files example here http://stackoverflow.com/questions/8602503/copy-xml-attributes-php/8606578#8606578 — William Walseth, Jan 21 '12 at 04:19
Use an XPath statement like "//table[@class='schedule']" or "//table[3]". — William Walseth, Jan 21 '12 at 04:20
And then? Could you please give an example? Because I just can't figure it out :S I've been trying and looking the whole night now... — SuperSpy, Jan 21 '12 at 04:21
I don't see the string "schedule" anywhere in the html you provided. What exactly is the desired output you want? You are using terms imprecisely ("tag", "element", "the html not the values", etc), so we are having trouble understanding your question. — Francis Avila, Jan 21 '12 at 18:29
@FrancisAvila: I've modified my question. Keep in mind that I am Dutch and not an expert in php. Oow, and also take a look at my solution :) — SuperSpy, Jan 22 '12 at 11:15

Dimitre Novatchev · Answer 1 · 2012-01-21T18:48:12.897

First of all, do note that XPath is based on the XML Infopath -- a model of XML where there are no "starting tag" and "ending tag" bu there are only nodes

Therfore, one shouldn't expect an XPath expression to select "tags" -- it selects nodes.

Taking this fact into account, I interpret the question as:

I want to obtain the set of all elements that are between a given "start" element and a given "end element", including the start and end elements.

In XPath 2.0 this can be done conveniently with the standard operator intersect.

In XPath 1.0 (which I assume you are using) this is not so easy. The solution is to use the Kayessian (by @Michael Kay) formula for node-set intersection:

The intersection of two node-sets: $ns1 and $ns2 is selected by evaluating the following XPath expression:

$ns1[count(.|$ns2) = count($ns2)]

Let's assume that we have the following XML document (as you never provided one):

<html>
    <body>
        <table>
            <tr valign="top">
                <td>
                    <table class="target">
                        <tr>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Starting Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Ending Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                        </tr>
                    </table>
                </td>
            </tr>
        </table>
    </body>
</html>

The start-element is selected by:

//table[@class = 'target']
         //td[. = 'Starting Node']

The end-element is selected by:

//table[@class = 'target']
         //td[. = Ending Node']

To obtain all wanted nodes we intersect the following two sets:

The set consisting of the start elementand all following elements (we name this $vFollowing).
The set consisting of the end element and all preceding elements (we name this $vPreceding).

These are selected, respectively by the following XPath expressions:

$vFollowing:

$vStartNode | $vStartNode/following::*

$vPreceding:

$vEndNode | $vEndNode/preceding::*

Now we can simply apply the Kayessian formula on the nodesets $vFollowing and $vPreceding:

       $vFollowing
          [count(.|$vPreceding)
          =
           count($vPreceding)
          ]

What remains is to substitute all variables with their respective expressions.

XSLT - based verification:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vStartNode" select=
 "//table[@class = 'target']//td[. = 'Starting Node']"/>

 <xsl:variable name="vEndNode" select=
 "//table[@class = 'target']//td[. = 'Ending Node']"/>

 <xsl:variable name="vFollowing" select=
 "$vStartNode | $vStartNode/following::*"/>

 <xsl:variable name="vPreceding" select=
 "$vEndNode | $vEndNode/preceding::*"/>

 <xsl:template match="/">
      <xsl:copy-of select=
          "$vFollowing
              [count(.|$vPreceding)
              =
               count($vPreceding)
              ]"/>
 </xsl:template>
</xsl:stylesheet>

when applied on the XML document above, the XPath expressions are evaluated and the wanted, correct resulting-selected node-set is output:

<td>Starting Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Ending Node</td>

I now provided the piece of source code that I want to display. Mind that I want to display all the html. — SuperSpy, Jan 21 '12 at 12:11
@SuperSpy: This isn't well-formed XML at all -- you need to clean it to make it well-formed XML. XPath operates on well-formed XML documents. — Dimitre Novatchev, Jan 21 '12 at 16:04
I cant format it, and it isn't xml. It is the source of another website. Take a look at my solution. (without DomDoc or XPath though..) — SuperSpy, Jan 22 '12 at 09:52

score 1 · Answer 2 · answered Jan 22 '12 at 15:02

Do not use regexes (or strpos...) to parse HTML!

Part of why this problem was difficult for you is you are thinking in "tags" instead of "nodes" or "elements". Tags are an artifact of serialization. (HTML has optional end tags.) Nodes are the actual data structure. A DOMDocument has no "tags", only "nodes" arranged in the proper tree structure.

Here is how you get your table with XPath:

// This is a simple solution, but only works if the value of "class" attribute is exactly "schedule"
// $xpath = '//table[@class="schedule"]';

// This is what you want. It is equivalent to the "table.schedule" css selector:
$xpath = "//table[contains(concat(' ',normalize-space(@class),' '),' schedule ')]";

$d = new DOMDocument();
$d->loadHTMLFile('http://example.org');
$xp = new DOMXPath($d);
$tables = $xp->query($xpath);
foreach ($tables as $table) {
    $table; // this is a DOMElement of a table with class="schedule"; It includes all nodes which are children of it.
}

score 0 · Answer 3 · answered Jan 21 '12 at 04:40

If you have well formed HTML like this

<html>
<body>
    <table>
        <tr valign='top'>
            <td>
                <table class='inner'>
                    <tr><td>Inner Table</td></tr>
                </table>
            </td>
            <td>
                <table class='second inner'>
                    <tr><td>Second  Inner</td></tr>
                </table>
            </td>
        </tr>
    </table>
</body>
</html>

Output the nodes (in an xml wrapper) with this pho code

<?php
    $xml = new DOMDocument();
    $strFileName = "t.xml";
    $xml->load($strFileName);

    $xmlCopy = new DOMDocument();
    $xmlCopy->loadXML( "<xml/>" ); 

    $xpath = new domxpath( $xml );
    $strXPath = "//table[@class='inner']";

    $elements = $xpath->query( $strXPath, $xml );
    foreach( $elements as $element ) {
        $ndTemp = $xmlCopy->importNode( $element, true );
        $xmlCopy->documentElement->appendChild( $ndTemp );
    }
    echo $xmlCopy->saveXML();
?>

Tis doesn't seem to work. I've tried hard to make it work though... I've edited my post. Maybe you can help me better now. — SuperSpy, Jan 21 '12 at 12:10
@SuperSpy, I'm not sure what's not working, or what output you're expecting. The above example extracts an inner table wrapped in an outer table, isn't that what you're looking to do? — William Walseth, Jan 21 '12 at 17:20
I've updated my question and I've got a solution (though without XPath). — SuperSpy, Jan 22 '12 at 11:16

SuperSpy · Accepted Answer · 2014-08-10T20:16:05.310

This gets the whole table. But it can be modified to let it grab another tag. This is quite a case specific solution which can only be used onder specific circumstances. Breaks if html, php or css comments containt the opening or closing tag. Use it with caution.

Function:

// **********************************************************************************
// Gets a whole html tag with its contents.
//  - Source should be a well formatted html string (get it with file_get_contents or cURL)
//  - You CAN provide a custom startTag with in it e.g. an id or something else (<table style='border:0;')
//    This is recommended if it is not the only p/table/h2/etc. tag in the script.
//  - Ignores closing tags if there is an opening tag of the same sort you provided. Got it?
function getTagWithContents($source, $tag, $customStartTag = false)
{

    $startTag = '<'.$tag;
    $endTag   = '</'.$tag.'>';

    $startTagLength = strlen($startTag);
    $endTagLength   = strlen($endTag);

//      ***************************** 
    if ($customStartTag)
        $gotStartTag = strpos($source, $customStartTag);
    else
        $gotStartTag = strpos($source, $startTag);

    // Can't find it?
    if (!$gotStartTag)
        return false;       
    else
    {

//      ***************************** 

        // This is the hard part: finding the correct closing tag position.
        // <table class="schedule">
        //     <table>
        //     </table> <-- Not this one
        // </table> <-- But this one

        $foundIt          = false;
        $locationInScript = $gotStartTag;
        $startPosition    = $gotStartTag;

        // Checks if there is an opening tag before the start tag.
        while ($foundIt == false)
        {
            $gotAnotherStart = strpos($source, $startTag, $locationInScript + $startTagLength);
            $endPosition        = strpos($source, $endTag,   $locationInScript + $endTagLength);

            // If it can find another opening tag before the closing tag, skip that closing tag.
            if ($gotAnotherStart && $gotAnotherStart < $endPosition)
            {               
                $locationInScript = $endPosition;
            }
            else
            {
                $foundIt  = true;
                $endPosition = $endPosition + $endTagLength;
            }
        }

//      ***************************** 

        // cut the piece from its source and return it.
        return substr($source, $startPosition, ($endPosition - $startPosition));

    } 
}

Application of the function:

$gotTable = getTagWithContents($tableData, 'table', '<table class="schedule"');
if (!$gotTable)
{
    $error = 'Faild to log in or to get the tag';
}
else
{
    //Do something you want to do with it, e.g. display it or clean it...
    $cleanTable = preg_replace('|href=\'(.*)\'|', '', $gotTable);
    $cleanTable = preg_replace('|TITLE="(.*)"|', '', $cleanTable);
}

Above you can find my final solution to my problem. Below the old solution out of which I made a function for universal use.

Old solution:

// Try to find the table and remember its starting position. Check for succes.
// No success means the user is not logged in.
$gotTableStart = strpos($source, '<table class="schedule"');
if (!$gotTableStart)
{
    $err = 'Can\'t find the table start';
}
else
{

//      ***************************** 
    // This is the hard part: finding the closing tag.
    $foundIt          = false;
    $locationInScript = $gotTableStart;
    $tableStart       = $gotTableStart;

    while ($foundIt == false)
    {
        $innerTablePos = strpos($source, '<table', $locationInScript + 6);
        $tableEnd      = strpos($source, '</table>', $locationInScript + 7);

        // If it can find '<table' before '</table>' skip that closing tag.
        if ($innerTablePos != false && $innerTablePos < $tableEnd)
        {               
            $locationInScript = $tableEnd;
        }
        else
        {
            $foundIt  = true;
            $tableEnd = $tableEnd + 8;
        }
    }

//      ***************************** 

    // Clear the table from links and popups...
    $rawTable   = substr($tableData, $tableStart, ($tableEnd - $tableStart));

}

You should not be using string manipulation on HTML. This will quickly fail in the face of optional end tags or bad markup. This is a little more clever than what people usually do, but it is still dangerous and unnecessary because `DOMDocument` will do all the hard parsing for you! — Francis Avila, Jan 22 '12 at 15:18
@FrancisAvila: WEll, show me how to get the same result. Because I really can't. Not even after some tutorials. — SuperSpy, Jan 22 '12 at 15:27
@SuperSpy: You can convert the document to XML using some tool like XML-Tidy, or you can use a tool that allows XPath-like expressions to be evaluated on an HTML document -- like Html Agility Pack, or Chris Lovett's SGML reader (this one can also easily be used to convert the HTML into an XML document). — Dimitre Novatchev, Jan 22 '12 at 16:11
@SuperSpy I added an answer already, [see here](http://stackoverflow.com/a/8961992/1002469). @Dimitre, on PHP `DOMDocument` can parse HTML using `loadHTML*` methods (it uses underlying libxml2 html parser) and [html5lib](http://code.google.com/p/html5lib/) can produce a `DOMDocument` using an HTML5 parser. And once you have a `DOMDocument` you can issue XPath queries against it. Just FYI if you are less familiar with the PHP environment. — Francis Avila, Jan 22 '12 at 16:55
@FrancisAvila: Thank you, I don't know anything about PHP. Is the XPath support on the loaded HTML document fully XPath-compliant or are there deviations/limitations? — Dimitre Novatchev, Jan 22 '12 at 17:08
Yes, it is fully compliant XPath 1.0 with no limitations because an HTML document is parsed into the same `DOMDocument` data structure that an XML document would be. (Its former identity as an HTML string is forgotten.) The only troubles arise from the nature of invalid HTML not having a clearly defined set of rules for the resulting node tree. If you need a consistent cross-platform DOM tree for a given HTML string, the best approach is to use `html5lib` which uses well-defined HTML5 parsing rules. (Libxml2 does not as yet implement an html5 parser.) — Francis Avila, Jan 22 '12 at 17:16

How to get everything between two HTML tags? (with XPath?)

4 Answers4

Linked