1

I am trying to write a small PHP function that would go through a page (provided a url) and would return the number of links and the number of links that link to the same page. For example, if I provide google.com as a URL it should return how many links are there on google.com and how many links link back to google.com (including of course www.google.com, google.com, google.com/#, etc)

Is that easy to do, and how would I do it?

(this is NOT a homework question so please provide as much help as possible

If you need more information about what I mean with the question just ask me to do provide more information

user220755
  • 4,358
  • 16
  • 51
  • 68

4 Answers4

3

I'd suggest SimpleXml or DOM for this task, but they will choke on invalid markup and unfortunately, the majority of the web is still using invalid markup, including Google you mentioned in your question. You could fetch the HTML from these URLs and tidy them, but you can also use SimpleHTML

$links = array('inbound' => array(),
               'outbound' => array());

$url  = 'http://www.example.com';
$host = parse_url($url, PHP_URL_HOST);    
$html = file_get_html($url);
foreach($html->findElementsByTagName('a') as $link) {
    $linkHost = parse_url($link->href, PHP_URL_HOST);
    $type     = ($linkHost === $host) ? 'inbound' : 'outbound';
    $links[$type][] = $link->href;
}

print_r($links);

Please note that I do not have SimpleHTML installed atm, so the above might not work out of the box. It should point you into the right direction though.


EDIT

Oh boy, did I really wrote this? Was I drunk or something? And why did no one complain about it? To correct myself:

DOM handles broken HTML fine if you use the loadHTML() method. SimpleXml doesnt. The suggested solution with SimpleHtmlDom will probably work, but IMO SimpleHTMLDom sucks. Better third party libraries can be found in Best Methods to parse HTML.

Community
  • 1
  • 1
Gordon
  • 312,688
  • 75
  • 539
  • 559
  • i did install simple html, and implemented the code but it still did not work, is there a reason why is that happening? – user220755 Jan 27 '10 at 22:38
  • Well, like I said below the code, it *might not work out of the box*, but I won't be able to help if you don't tell me what exactly is not working. Does it show any errors? Did you change the $url? Does it find any links at all? Will there be any links in $links? More details please :) – Gordon Jan 27 '10 at 23:06
1

Load the page content into a variable:

$html = file_get_contents("http://www.somesite.com"); 

and do a preg_match() on $html

check php manual for that one.

http://www.php.net/manual/en/function.preg-match.php

Dr Casper Black
  • 7,350
  • 1
  • 26
  • 33
  • 1
    While this will work, Regex is not the right tool for *parsing* HTML. Regex is for string pattern matching, not DOM traversal. – Gordon Jan 26 '10 at 17:58
0

http://php.net/manual/en/book.simplexml.php

You can use simpleXML to find all the links in a page, and then parse the resulting links with preg_match to see if they match what you're searching for.

Erik
  • 20,526
  • 8
  • 45
  • 76
0

Combination of a regular expression and a hash I'd say. My PhP sucks but it'd be something like this in Perl

my %Counter;
while(my currentLine = <inData>){
if($currentLine = /(www\..+\.+\/)/){
  $Counter{$1}++;
}
}

foreach $thingy (keys %Counter){
 print "There are $Counter{$thingy} links to $thingy in this document\n";
}
dangerstat
  • 487
  • 2
  • 10
  • And if there's a plain text URL in the page? He wants links. You do not parse HTML with regular expressions. – Erik Jan 26 '10 at 19:47