0

I am working on a simple php page which does this:

  1. Takes search string from url querystring (e.g. police officer)
  2. Appends the search string to a wikipedia search url (`https://en.wikipedia.org/w/index.php?search=police+officer')
  3. Use curl to get the final redirected URL for that search string
  4. Check if the redirected URL contains index.php?search - if it does, do nothing
  5. Otherwise, explode the redirected url and get the last value from the url (Police_officer)
  6. Append that value to Wikipedia URL which returns JSON data for that wiki record (https://en.wikipedia.org/api/rest_v1/page/summary/Police_officer)
  7. Use file_get_contents() to read the JSON data and get data back - e.g. title

For some reason, on this line of code:

$json = file_get_contents($url_json);

Where $url_json

https://en.wikipedia.org/api/rest_v1/page/summary/Santa_claus

I get this error:

Warning: file_get_contents(https://en.wikipedia.org/api/rest_v1/page/summary/Santa_claus): failed to open stream: HTTP request failed! HTTP/1.1 404 Not Found in C:\xampp\public_html\test.php on line 49

Yet I can go to that URL in the browser and see just the same type of data as I can for this URL:

https://en.wikipedia.org/api/rest_v1/page/summary/Police_officer

And for that one, file_get_contents returns the data just fine.

I used this code:

function get_http_response_code($url) {
    $headers = get_headers($url);
    return substr($headers[0], 9, 3);
}

To confirm that the response code for both pages = 200.

This is my basic test code:

$var = $_GET['var'];
$var = str_replace(" ", "+", $var);

$url1 = "https://en.wikipedia.org/w/index.php?search=$var";

echo "<hr /> url1: $url1 <hr />";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url1);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch);
$redirected_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

echo "<hr /> url2: $redirected_url <hr />";

$url_search = strpos($redirected_url, "index.php?search");

echo "<hr /> url_search: $url_search <hr />";

function get_http_response_code($url) {
    $headers = get_headers($url);
    return substr($headers[0], 9, 3);
}

$url_response = get_http_response_code($redirected_url);

echo "<hr /> url_response: $url_response <hr />";

if ($url_search > 0) {

    // do nothing

} else {

    $tmp = explode('/', $redirected_url);
    $end = end($tmp);

    $url_json = "https://en.wikipedia.org/api/rest_v1/page/summary/$end";

    echo "<hr /> url_json: $url_json <hr />";

    $json = file_get_contents($url_json);

    if ($json) {

        $data = json_decode($json, TRUE);

        if ($data) {
            $wiki_page = $data['content_urls']['desktop']['page'];
            echo "<hr /> wiki_page: $wiki_page <hr />";
        }

    }

}

What have I missed?

4532066
  • 2,042
  • 5
  • 21
  • 48

1 Answers1

0

Fixed once I used curl instead of file_get_contents

$var = $_GET['var'];
$var = str_replace(" ", "+", $var);

$url1 = "https://en.wikipedia.org/w/index.php?search=$var";

echo "<hr /> url1: $url1 <hr />";

$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url1);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$a = curl_exec($ch);
$redirected_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

echo "<hr /> url2: $redirected_url <hr />";

$url_search = strpos($redirected_url, "index.php?search");

echo "<hr /> url_search: $url_search <hr />";

function get_http_response_code($url) {
    $headers = get_headers($url);
    return substr($headers[0], 9, 3);
}

function file_get_contents_curl($url) {
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_AUTOREFERER, TRUE);
    curl_setopt($ch, CURLOPT_HEADER, 0);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);  
    curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 3);     
    curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, FALSE);
    $html = curl_exec($ch);
    curl_close($ch);
    return $html;
}

$url_response = get_http_response_code($redirected_url);

echo "<hr /> url_response: $url_response <hr />";

if ($url_search > 0) {

    // do nothing

} else {

    $tmp = explode('/', $redirected_url);
    $end = end($tmp);

    $url_json = "https://en.wikipedia.org/api/rest_v1/page/summary/$end";

    echo "<hr /> url_json: $url_json <hr />";

    //$json = file_get_contents($url_json);

    $json = file_get_contents_curl($url_json);

    echo "<hr /> json: $json <hr />";

    if ($json) {

        $data = json_decode($json, TRUE);

        echo "<hr /> data: $data <hr />";

        if ($data) {
            $wiki_page = $data['content_urls']['desktop']['page'];
            echo "<hr /> wiki_page: $wiki_page <hr />";
        }

    }

}
4532066
  • 2,042
  • 5
  • 21
  • 48