0

I was trying something like that:

$url = “http://www.howtogeek.com”;

$str = file_get_contents($url);

That displays the whole website. It's not the website in the $url, and the website I'm trying to retrieve results doesn't have an API that I could use. I want to get number of results titles of results and so on. Is there any easy way to do this?

Tom Tom
  • 3,680
  • 5
  • 35
  • 40
Hig
  • 31
  • 1
  • Are you asking this in the general sense of websites without API's, or specifically about the howtogeek.com website? – Erwin Bolwidt Feb 22 '15 at 15:27
  • in the general sense – Hig Feb 22 '15 at 15:36
  • Have a look at [Goutte](https://github.com/FriendsOfPHP/Goutte), which will allow you to read a page and then iterate over it using XPath or CSS expressions. Bear in mind the docs are a bit light, but it is good. – halfer Feb 22 '15 at 15:48

1 Answers1

1

Yes you need to use a DOM Parser like the DomDocument class. Usage:

$doc = new DOMDocument();
$doc->loadHTML($html);

Then find the appropriate methods to do what you want.

You could also do it with preg_match_all() depending on what you want to do exactly, but it can be next to impossible for a full webpage, especially if you don't control the source yourself.

Community
  • 1
  • 1
Tom Tom
  • 3,680
  • 5
  • 35
  • 40
  • 1
    Yep, really don't use regular expressions to parse HTML. There is a [canonical answer on Stack Overflow](http://stackoverflow.com/a/1732454/472495) about that... – halfer Feb 22 '15 at 15:41
  • that is what I've tried for now: $doc = new DOMDocument(); $doc->loadHTML($url); $doc->preserveWhiteSpace = false; $resultCounter = $doc->getElementById('resultCounter'); echo $resultCounter->nodeValue; and I got Trying to get property of non-object on last line – Hig Feb 22 '15 at 18:51
  • You still need to get the html with $html = file_get_contents($url); ibeforehand. loadHTML() does not take in a url. – Tom Tom Feb 22 '15 at 19:39
  • $doc = new DOMDocument(); $html = file_get_contents($url); $doc->loadHTML($html); $doc->preserveWhiteSpace = false; $resultCounter = $doc->getElementById('resultCounter'); echo $resultCounter->nodeValue; That is what i've tried right now and I get :loadHTML(): htmlParseEntityRef: expecting ';' in Entity on line with loadhtml – Hig Feb 22 '15 at 20:18
  • See here http://stackoverflow.com/questions/1685277/warning-domdocumentloadhtml-htmlparseentityref-expecting-in-entity – Tom Tom Feb 22 '15 at 22:46