I use a site that stores two cookies (ASP.NET_SessionID
and __RequestVerificationToken_XXXXXXXXX
) when you visit it.
The page consists of a div with a link to a pdf and an iframe with a "pdf viewer" source.
I am trying to use cURL to retrieve those two cookies then download the pdf. I have found that I have to set several options in cURL. However, I am still not able to download the pdf.
My setup now is:
- Hit the main page and (a) save the
ASP.NET_SessionID
cookie, (b) find the "pdf viewer" URL from the iframe, and (c) find the pdf download URL - Hit the "pdf viewer" URL and save the
__RequestVerificationToken_XXXXXXXXX
cookie - Create the cookie header from steps 1 and 2
- Download the file using cURL, the pdf download URL, and sending cookie headers
However, my file result is just a login page.
First cURL:
$agent= 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0';
$report_url = "[my_main_url_here]";
$ch1 = curl_init($report_url);
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch1, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch1, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch1, CURLOPT_HEADER, true);
curl_setopt($ch1, CURLOPT_SSLVERSION, 4);
curl_setopt($ch1, CURLOPT_USERAGENT, $agent);
curl_setopt($ch1, CURLOPT_SSL_CIPHER_LIST, 'AES128-SHA:RC2-CBC-MD5');
curl_setopt($ch1, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch1, CURLOPT_HEADER, 1);
curl_setopt($ch1, CURLOPT_VERBOSE, true);
curl_setopt($ch1, CURLOPT_NOBODY, false);
$output1 = curl_exec($ch1);
curl_close($ch1);
I use preg_match
to find the pdf download link:
preg_match("/\/ReportID=.{30}/", $output1, $pdf_link);
$pdf_viewer_full = "https://gate.aon.com" . $pdf_link[0];
Then I hit the pdf viewer URL to get the second cookie:
$ch2 = curl_init($viewer_url_full);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch2, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch2, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch2, CURLOPT_HEADER, true);
curl_setopt($ch2, CURLOPT_SSLVERSION, 4);
curl_setopt($ch2, CURLOPT_USERAGENT, $agent);
curl_setopt($ch2, CURLOPT_SSL_CIPHER_LIST, 'AES128-SHA:RC2-CBC-MD5');
curl_setopt($ch2, CURLOPT_HEADER, 1);
curl_setopt($ch2, CURLOPT_VERBOSE, true);
curl_setopt($ch2, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch2, CURLOPT_NOBODY, false);
$output2 = curl_exec($ch2);
curl_close($ch2);
I then pull out the cookies from the headers of both of those:
preg_match("/ASP.NET_SessionId=......................../", $output1, $cookie1);
preg_match("/__RequestVerificationToken_.{145}/", $output2, $cookie2);
$cookies = 'Cookie: ' . $cookie1[0] . '; ' . $cookie2[0];
And then attempt to download the file:
$headers = array ($cookies);
$file = fopen ('Report.pdf', 'w+');
$ch3 = curl_init($pdf_link_full);
curl_setopt($ch3, CURLOPT_SSL_CIPHER_LIST, 'AES128-SHA:RC2-CBC-MD5');
curl_setopt($ch3, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch3, CURLOPT_FILE, $file);
curl_setopt($ch3, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch3, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch3, CURLOPT_SSLVERSION, 4);
curl_setopt($ch3, CURLOPT_USERAGENT, $agent);
curl_setopt($ch3, CURLOPT_COOKIEFILE, "cookie.txt");
$output3 = curl_exec($ch3);
curl_close($ch3);
EDIT: If I manually set $pdf_link_full
, it works. However, if I find it with preg_match
(like above), it fails.
However, if I print $pdf_link_full
and $pdf_link_full_2
, they appear as the same exact thing. Am I missing encoding or something else here? Thanks!