-2

I am trying to find next page's link of a particular page(i call that particular page as current page here).The current page in program i am using is

http://en.wikipedia.org/wiki/Category:1980_births

The next page link which i am extracting from the current page is the below one

http://en.wikipedia.org/w/index.php?title=Category:1980_births&pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages

But ,, when file_get_contents() function load the next page link it's getting the the current page contents ,,,

The code is

<?php

$string = file_get_contents("http://en.wikipedia.org/wiki/Category:1980_births");  //Getting contents of current page , 
preg_match_all("/\(previous page\) \(<a href=\"(.*)\" title/",  $string,$matches);    // extracting the next_page_link from the current page contents

foreach ($matches[1] as $match) {
break;
}

$next_page_link = $match;  
$next_page_link =  "http://en.wikipedia.org" . $next_page_link; //the next_link will have only the path , does't contain the domain name ,,, so i am adding the domain name here, this does't make any impact on the problem statement

$string1 = file_get_contents($next_page_link);
echo $next_page_link;
echo $string1;

?>

As per the code string1 should have next_page_link's content ,, but instead it just getting the current page's content.

Siva Kannan
  • 2,237
  • 4
  • 27
  • 39
  • Uhm... `foreach ($matches[1] as $match) break;`... Do you mean `$match = $matches[1][0];`?! – deceze May 25 '15 at 07:32
  • yes correct(i made it complex) ,, we can even remove the break statement ,,, totally that current page has three next page links((all are same page link only) ,,, i am just taking the first one and breaking the loop(just to avoid next 2 iterations),,, – Siva Kannan May 25 '15 at 07:35
  • I don't see any differences on my local test!! – someOne May 25 '15 at 08:11
  • could you be specific ,, The program should print http://en.wikipedia.org/w/index.php?title=Category:1980_births&pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages as the result ,, but http://en.wikipedia.org/wiki/Category:1980_births this page was loading in the program for me ,,, – Siva Kannan May 25 '15 at 09:39

1 Answers1

1

In the source of the original web site, the links have entity-encoded ampersands (See Do I encode ampersands in <a href…>?). The browser decodes them normally when you click the anchor, but your scraping code does not. Compare

http://en.wikipedia.org/ ... &amp;pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages

versus

http://en.wikipedia.org ... &pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages

This malformed querystring is what you are in fact passing into file_get_contents. You can convert them back to regular ampersands like this:

// $next_page_link = $match; 
$next_page_link = html_entity_decode($match);
Community
  • 1
  • 1
Drakes
  • 23,254
  • 3
  • 51
  • 94