1

I am trying to grab the meta data from a news article on the NY Times website, specifically http://www.nytimes.com/2014/06/25/us/politics/thad-cochran-chris-mcdaniel-mississippi-senate-primary.html

Whenever I try however I am getting redirects from the sight because my "browser" does not accept cookies. I have enabled the curl options to save cookies and tried following the accepted answers in a few other StackOverflow questions (here, here, and here) and while the answer worked on those websites it doesn't seem to work on the nytimes site.

My current php curl function looks like this:

function get_extra_meta_tags_curl($url) {
    $ckfile = tempnam("/public_html/commentarium/", "cookies.txt");

    $ch = curl_init($main_url);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    $output = curl_exec($ch);

    $ch = curl_init($url);
    curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
    curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
    $output = curl_exec($ch);
    curl_close($ch);

    echo $output;
}

The problem appears to be that when I request the URL, nytimes.com checks if the browser accepts cookies. I checks a couple of times before redirecting to the login page with a REFUSE_COOKIE_ERROR. Instead of posting the full redirect list here you can see it on my test page here along with the raw html that the final redirect returns and what my current get_extra_meta_tags_curl function is returning under CURL test

Thanks for any help!

Community
  • 1
  • 1
Russell Winkler
  • 277
  • 1
  • 3
  • 12

1 Answers1

1

You enable cookies auto-handling in wrong manner. CURLOPT_COOKIEJAR only enables cookies saving (storing), but you need also enable cookies loading and passing them with request (by CURLOPT_COOKIEFILE option). Otherwise cookies auto-handling won't work and you will experienced mentioned "Browser does not accept cookies" problem.

So you have to set both CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE options to the same value ($ckfile) at each CURL request:

...
curl_setopt($ch, CURLOPT_COOKIEJAR, $ckfile);
curl_setopt($ch, CURLOPT_COOKIEFILE, $ckfile);
...
hindmost
  • 7,125
  • 3
  • 27
  • 39
  • Thanks, I have tried that as was suggested in one of the other stackoverflow questions but I am still getting the same response. I have updated my question to reflect your suggestion as well as the test page. – Russell Winkler Jun 28 '14 at 19:45
  • @Russell Winkler Read the answer attentively. You have to set **both** options at **each** CURL request – hindmost Jun 28 '14 at 20:01
  • I apologize, that was my error in updating the post. I did add the `CURLOPT_COOKIEFILE` in both requests and have now updated the post appropriately. – Russell Winkler Jun 28 '14 at 20:08