2

Possible Duplicate:
How to parse HTML with PHP?

i want to write a php-program that count all hyperlinks of a website, the user can enter.

how to do this? is there a libary or something which i can parse and analyze the html about the hyperlinks?

thanks for your help

Community
  • 1
  • 1
Sheldon Cooper
  • 617
  • 1
  • 5
  • 9
  • for a perticular web page you can count by using php and javascript/ jquery. see this http://stackoverflow.com/questions/3184284/count-all-html-tags-in-page-php as ref – Punit Sep 29 '11 at 15:01

3 Answers3

1

Like this

<?php
$site  = file_get_contents("someurl");
$links = substr_count($site, "<a href=");
print"There is {$links} in that page.";
?>
Olli
  • 752
  • 6
  • 20
  • And this is simple way, I think you could do it more better with some alternatives. – Olli Sep 29 '11 at 14:58
  • simple, but enough for my needs. thanks. additionally the things mentioned in the comments in my question are more powerful solutions, which i will use, if this is not enough – Sheldon Cooper Sep 29 '11 at 16:27
0

Well, we won't be able to give you a finite answer but only pointers. I've done a search engine once out of php so the principle will be the same:

  1. First of all you need to code your script as a console script, a web script is not really appropriate but it's all a question of tastes
  2. You need to understand how to work with sockets in PHP and make requests, look at the php socket library at: http://www.php.net/manual/ref.network.php
  3. You will need to get versed in the world of HTTP requests, learn how to make your own GET/POST requests and split the headers from the returned content.
  4. Last part will be easy with regexp, just preg_match the content for "#()*#i" (the last expression might be wrong, i didn't test it at all ok?)
  5. Loop the list of found hrefs, compare to already visited hrefs (remember to take into account wildcard GET params in your stuff) and then repeat the process to load all the pages of a site.

It IS HARD WORK... good luck

Mathieu Dumoulin
  • 12,126
  • 7
  • 43
  • 71
  • Note i just saw the file_get_contents, and indeed you could replace the SOCKET portion by file_get_contents with a $context and it will make your life easier in many points. I'm an old school so i tend to use the oldschool ways :) – Mathieu Dumoulin Sep 29 '11 at 15:02
-2

You may have to use CURL to fetech the contents of the webpage. Store that in a variable then parse it for hyperlinks. You might need regular expression for that.

TheTechGuy
  • 16,560
  • 16
  • 115
  • 136