Since I recommended using a proper HTML parser, which can be a bit intimidating for the uninitiated, I figured I could give you an example, to start of with:
$url = 'https://www.anything.com';
// create a new DOMDocument (an XML/HTML parser)
$doc = new DOMDocument;
// this is used to repair possibly malformed HTML
$doc->recover = true;
// libxml is the parse library that DOMDocument internally uses
// put errors in a memory buffer, in stead of outputting them immediately (basically ignore them, until you need them, if ever)
libxml_use_internal_errors( true );
// load the external URL; this might not work if retrieving external files is disabled.
// I will come back on that if it doesn't work for you.
$doc->loadHTMLFile( $url );
// xpath is a query language that allows you to query XML/HTML data structures.
// we create an DOMXPath instance that operates on the earlier created DOMDocument
$xpath = new DOMXPath( $doc );
// this is a query to get all <table class="main">
// note though, that it will also match <table class="test maintain">, etc.
// which might not be what you need
$tableMainQuery = '//table[contains(@class,"main")]';
/* explanation:
// match any descendant of the current context, in this case root
table match <table> elements
[] with the predicate(s)
contains() match a string, that contains some string, in this case:
@class the attribute 'class'
'main' containing the string main
*/
// execute the query
// $nodes will be an instance of DOMNodeList (containing DOMNode instances)
$nodes = $xpath->query( $tableMainQuery );
// loop through all nodes
foreach( $nodes as $node ) {
// echo the inner HTML content of the found node (or do something else with it)
// the getInnerHTML() helper function is defined below)
// remove htmlentities to get the actual HTML
echo htmlentities( getInnerHTML( $node ) );
}
// this is a query to get all <a class="link">
// similar comments and explanation apply as with previous query
$aLinkQuery = '//a[contains(@class,"link")]';
// execute the query
// $nodes will be an instance of DOMNodeList (containing DOMNode instances)
$nodes = $xpath->query( $aLinkQuery );
// loop through all nodes
foreach( $nodes as $node ) {
// do something with the found nodes again
}
// clear any errors still left in memory
libxml_clear_errors();
// set previous state
libxml_use_internal_errors( $useInternalErrors );
// the helper function to get the inner HTML of a found node
function getInnerHTML( DOMNode $node ) {
$html = '';
foreach( $node->childNodes as $childNode ) {
$html .= $childNode->ownerDocument->saveHTML( $childNode );
}
return $html;
}
Now, to get only the first found node of an xpath query (a DOMNodeList
instance), I think the simplest would be:
// in both the examples below $node will contain the element you are looking for
// $nodes will keep being a list of all found nodes
if( $nodes->length > 0 ) {
$node = $nodes->item( 0 );
// do something with the $node
}
// or, perhaps
if( null !== ( $node = $nodes->item( 0 ) ) ) {
// do something with the $node
}
You could also adjust the xpath query to only find the first matching node, but I believe it would then still return a DOMNodeList
.