1

I am using the following code succesfully to receive external content from a table class.

$url = 'https://www.anything.com';
$content = file_get_contents($url);
$first_step = explode( '<table class="main">' , $content );
$second_step = explode("</table>" , $first_step[1] );

echo $second_step[0];

Now I need the content from a <a class="link">content</a>, but

$url = 'https://www.anything.com';
$content = file_get_contents($url);
$first_step = explode( '<a class="link">' , $content );
$second_step = explode("</a>" , $first_step[1] ); 

does not work.

Meanwhile I use this code

    // Create DOM from URL or file

    $sFilex = file_get_html("https://www.anything.com", False, $cxContext);

    // Find all links
    foreach($sFilex->find('a[class=link]') as $element)
    echo $element->href . '<br>';

to get all <a class="link">content</a> links successfully. But how can I limit this to the first found result only?

The correct code for the is

<a class="link" id="55834" href="/this/is/a/test">this is a test</a>

Thanks for your help!

Robin Alexander
  • 984
  • 2
  • 13
  • 29
  • 2
    You should probably use an HTML/XML parser, like [`DOMDocument`](http://php.net/manual/class.domdocument.php), for more reliable results. See [this question](https://stackoverflow.com/q/3577641) for more information about parsing HTML in PHP. – Decent Dabbler May 22 '17 at 12:02
  • Thanks, that helped me a lot! But how can I limit the results to the first one only? – Robin Alexander May 22 '17 at 13:03
  • I'm currently creating an answer for you, since I figured it could be a bit intimidating for the uninitiated. I'll add an example that answers your comment's question as well. – Decent Dabbler May 22 '17 at 13:04
  • Cool! Thanks you very much! – Robin Alexander May 22 '17 at 13:09
  • I've added an answer for you, which I believe it shouldn't have any errors. Let me know if it works for you and/or if you have any more questions about the answer. Also, if you approached it differently, first time around, tell me and I'll try to adjust my answer to your own approach. – Decent Dabbler May 22 '17 at 13:14
  • I didn't realize you had started using `simplehtmldom` already. I'm sorry to say that I don't have any experience with that parser. If you still want help with that, I'd recommend starting a new question about it, or perhaps add a tag for [tag:simple-html-dom]. – Decent Dabbler May 22 '17 at 13:41

1 Answers1

1

Since I recommended using a proper HTML parser, which can be a bit intimidating for the uninitiated, I figured I could give you an example, to start of with:

$url = 'https://www.anything.com';

// create a new DOMDocument (an XML/HTML parser)
$doc = new DOMDocument;
// this is used to repair possibly malformed HTML
$doc->recover = true;

// libxml is the parse library that DOMDocument internally uses
// put errors in a memory buffer, in stead of outputting them immediately (basically ignore them, until you need them, if ever)
libxml_use_internal_errors( true );

// load the external URL; this might not work if retrieving external files is disabled.
// I will come back on that if it doesn't work for you.
$doc->loadHTMLFile( $url );

// xpath is a query language that allows you to query XML/HTML data structures.
// we create an DOMXPath instance that operates on the earlier created DOMDocument
$xpath = new DOMXPath( $doc );

// this is a query to get all <table class="main">
// note though, that it will also match <table class="test maintain">, etc.
// which might not be what you need
$tableMainQuery = '//table[contains(@class,"main")]';
/* explanation:
   //         match any descendant of the current context, in this case root
   table      match <table> elements
   []         with the predicate(s)
   contains() match a string, that contains some string, in this case:
   @class     the attribute 'class'
   'main'     containing the string main
*/   

// execute the query
// $nodes will be an instance of DOMNodeList (containing DOMNode instances)
$nodes = $xpath->query( $tableMainQuery );

// loop through all nodes
foreach( $nodes as $node ) {
  // echo the inner HTML content of the found node (or do something else with it)
  // the getInnerHTML() helper function is defined below)
  // remove htmlentities to get the actual HTML
  echo htmlentities( getInnerHTML( $node ) );
}

// this is a query to get all <a class="link">
// similar comments and explanation apply as with previous query
$aLinkQuery = '//a[contains(@class,"link")]';

// execute the query
// $nodes will be an instance of DOMNodeList (containing DOMNode instances)
$nodes = $xpath->query( $aLinkQuery );

// loop through all nodes
foreach( $nodes as $node ) {
  // do something with the found nodes again
}

// clear any errors still left in memory
libxml_clear_errors();
// set previous state
libxml_use_internal_errors( $useInternalErrors );

// the helper function to get the inner HTML of a found node
function getInnerHTML( DOMNode $node ) {
  $html = '';
  foreach( $node->childNodes as $childNode ) {
    $html .= $childNode->ownerDocument->saveHTML( $childNode );
  }

  return $html;
}

Now, to get only the first found node of an xpath query (a DOMNodeList instance), I think the simplest would be:

// in both the examples below $node will contain the element you are looking for
// $nodes will keep being a list of all found nodes

if( $nodes->length > 0 ) {
  $node = $nodes->item( 0 );
  // do something with the $node
}

// or, perhaps
if( null !== ( $node = $nodes->item( 0 ) ) ) {
  // do something with the $node
}

You could also adjust the xpath query to only find the first matching node, but I believe it would then still return a DOMNodeList.

Decent Dabbler
  • 22,532
  • 8
  • 74
  • 106
  • Wow! Thanks so much for your help and time! It gets everything within the `content`, but using the `if( $nodes->len...` code lists all results as before. And: any chance to get the pure html from that `content` (including id, style, href,etc...)? – Robin Alexander May 22 '17 at 13:24
  • 1
    You're welcome @vloryan! I edited my answer in the meantime, which you may have missed. Let me know if it answered your "get pure HTML" question. If you want the complete HTML of a `$node` (the outer HTML, not just the inner HTML), you can simply do: `$node->ownerDocument->saveHTML( $node )`; – Decent Dabbler May 22 '17 at 13:30
  • And about the `if( $nodes->len...`: make sure you are not looping through the `$nodes` with `foreach()`, i.e.: replace the `foreach()` loops with the `if( $nodes->len...` examples. – Decent Dabbler May 22 '17 at 13:31
  • Nice, nice, thanks! But I still got all, not only the first result. I have tried to replace the `// loop through all nodes...` lines with your alternate `if( $nodes->length > 0 ) { $node = $nodes->item( 0 );}`, but there were no results at all. Sorry to bother you! – Robin Alexander May 22 '17 at 13:47
  • Are you sure `$nodes->length` is larger than `0`, for the relevant query? If it is `0` then `foreach(...` shouldn't work either. Try to do a `var_dump( $nodes->length )` to see what you get. Are you not mixing up the two separate queries here (the `` one and the `` one)?
    – Decent Dabbler May 22 '17 at 13:53
  • Ok, I got two problems left. The result for the `a class` does not show up in pure html though i use `echo htmlentities( getInnerHTML( $node ) );` And when I replace the foreach with the other code I still get no result at all (https://codeshare.io/5vKmQl). `var_dump( $nodes->length )` gives me `5` results. – Robin Alexander May 22 '17 at 17:22
  • 1
    @vloryan 1. remove the surrounding `htmlentities()` if you need the original HTML, `htmlentities()` simply allows you to view HTML in a browser as text. 2. You need to use the `$node` after `$node = $nodes->item( 0 );`. In other words: `$nodes` will keep containing **all** nodes found, `$node` will contain the **first** node found. – Decent Dabbler May 22 '17 at 17:34
  • Any chance to have a more precise alternative for `'//a[contains(@class,"link")]';`? Guess I have fetched other classes containg **link** :) – Robin Alexander May 22 '17 at 19:55
  • @vloryan That depends on how precise you can define it. :-) What's the most precise properties you can find about the link you want? (Without the actual `href` of course, because then there's no need to code it :-)) – Decent Dabbler May 22 '17 at 20:08
  • I got '//a[contains(@class,"games_tools")]' which also fetches `'. Exact match for 'games_tools' would be nice! – Robin Alexander May 22 '17 at 20:20
  • @vloryan Hmmm, that should not happen, unless the `play_tools` one also has a class named `games_tools`. In which case it's gonna get tricky. Than you'd need to find a property that uniquely identifies the link you need. – Decent Dabbler May 22 '17 at 20:25
  • Any chance to define exact matches? – Robin Alexander May 22 '17 at 20:26
  • @vloryan Yes, that would be `//a[@class="games_tools"]`, but as soon as another class is added to that link, you are in trouble. Could you edit your question and give a small sample of the actual HTML code you are parsing? – Decent Dabbler May 22 '17 at 20:28
  • My fault, play_tools was placed somewhere else, that's why i have a wrong result. any chance to define the `//a[contains(@class,"tooltip")]`; after this text (only occurs once per page): `Search results: Clubs`? – Robin Alexander May 22 '17 at 20:54
  • @vloryan Sorry, I was/am a bit tied up at the moment, but I saw you already asked a new question. I was going to suggest that as well. :-) Glad you found a solution! – Decent Dabbler May 23 '17 at 10:05
  • Hi, me again. Using this in a `php for each`loop results in this error `Fatal error: Cannot redeclare getInnerHTML() (previously declared in...` **after the first loop** (which works fine). Can you help me solving this? Would be great! Thanks. – Robin Alexander May 26 '17 at 12:58
  • @vloryan Yes, it's very simple: you cannot redefine a function in PHP, so you must put the definition of the function somewhere where you are sure it will only be defined once. So: outside the `foreach()` loop (perhaps at the top of your file, for instance). – Decent Dabbler May 27 '17 at 08:49