file_get_contents() function loads different page compared to broswer

Question

I am trying to find next page's link of a particular page(i call that particular page as current page here).The current page in program i am using is

http://en.wikipedia.org/wiki/Category:1980_births

The next page link which i am extracting from the current page is the below one

http://en.wikipedia.org/w/index.php?title=Category:1980_births&pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages

But ,, when file_get_contents() function load the next page link it's getting the the current page contents ,,,

The code is

<?php

$string = file_get_contents("http://en.wikipedia.org/wiki/Category:1980_births");  //Getting contents of current page , 
preg_match_all("/\(previous page\) \(<a href=\"(.*)\" title/",  $string,$matches);    // extracting the next_page_link from the current page contents

foreach ($matches[1] as $match) {
break;
}

$next_page_link = $match;  
$next_page_link =  "http://en.wikipedia.org" . $next_page_link; //the next_link will have only the path , does't contain the domain name ,,, so i am adding the domain name here, this does't make any impact on the problem statement

$string1 = file_get_contents($next_page_link);
echo $next_page_link;
echo $string1;

?>

As per the code string1 should have next_page_link's content ,, but instead it just getting the current page's content.

Uhm... `foreach ($matches[1] as $match) break;`... Do you mean `$match = $matches[1][0];`?! — deceze, May 25 '15 at 07:32
yes correct(i made it complex) ,, we can even remove the break statement ,,, totally that current page has three next page links((all are same page link only) ,,, i am just taking the first one and breaking the loop(just to avoid next 2 iterations),,, — Siva Kannan, May 25 '15 at 07:35
could you be specific ,, The program should print http://en.wikipedia.org/w/index.php?title=Category:1980_births&pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages as the result ,, but http://en.wikipedia.org/wiki/Category:1980_births this page was loading in the program for me ,,, — Siva Kannan, May 25 '15 at 09:39

score 1 · Accepted Answer · edited May 23 '17 at 12:06

In the source of the original web site, the links have entity-encoded ampersands (See Do I encode ampersands in <a href…>?). The browser decodes them normally when you click the anchor, but your scraping code does not. Compare

http://en.wikipedia.org/ ... &amp;pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages

versus

http://en.wikipedia.org ... &pagefrom=Alexis%2C+Toya%0AToya+Alexis#mw-pages

This malformed querystring is what you are in fact passing into file_get_contents. You can convert them back to regular ampersands like this:

// $next_page_link = $match; 
$next_page_link = html_entity_decode($match);

file_get_contents() function loads different page compared to broswer

1 Answers1