0

I use a site that stores two cookies (ASP.NET_SessionID and __RequestVerificationToken_XXXXXXXXX) when you visit it.

The page consists of a div with a link to a pdf and an iframe with a "pdf viewer" source.

I am trying to use cURL to retrieve those two cookies then download the pdf. I have found that I have to set several options in cURL. However, I am still not able to download the pdf.

My setup now is:

  1. Hit the main page and (a) save the ASP.NET_SessionID cookie, (b) find the "pdf viewer" URL from the iframe, and (c) find the pdf download URL
  2. Hit the "pdf viewer" URL and save the __RequestVerificationToken_XXXXXXXXX cookie
  3. Create the cookie header from steps 1 and 2
  4. Download the file using cURL, the pdf download URL, and sending cookie headers

However, my file result is just a login page.

First cURL:

$agent= 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0';
$report_url = "[my_main_url_here]";

$ch1 = curl_init($report_url);
curl_setopt($ch1, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch1, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch1, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch1, CURLOPT_HEADER, true);
curl_setopt($ch1, CURLOPT_SSLVERSION, 4);
curl_setopt($ch1, CURLOPT_USERAGENT, $agent);
curl_setopt($ch1, CURLOPT_SSL_CIPHER_LIST, 'AES128-SHA:RC2-CBC-MD5');
curl_setopt($ch1, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch1, CURLOPT_HEADER, 1);
curl_setopt($ch1, CURLOPT_VERBOSE, true);
curl_setopt($ch1, CURLOPT_NOBODY, false);
$output1 = curl_exec($ch1);
curl_close($ch1);

I use preg_match to find the pdf download link:

preg_match("/\/ReportID=.{30}/", $output1, $pdf_link);
$pdf_viewer_full = "https://gate.aon.com" . $pdf_link[0];

Then I hit the pdf viewer URL to get the second cookie:

$ch2 = curl_init($viewer_url_full);
curl_setopt($ch2, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch2, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch2, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch2, CURLOPT_HEADER, true);
curl_setopt($ch2, CURLOPT_SSLVERSION, 4);
curl_setopt($ch2, CURLOPT_USERAGENT, $agent);
curl_setopt($ch2, CURLOPT_SSL_CIPHER_LIST, 'AES128-SHA:RC2-CBC-MD5');
curl_setopt($ch2, CURLOPT_HEADER, 1);
curl_setopt($ch2, CURLOPT_VERBOSE, true);
curl_setopt($ch2, CURLOPT_COOKIEJAR, "cookie.txt");
curl_setopt($ch2, CURLOPT_NOBODY, false);
$output2 = curl_exec($ch2);
curl_close($ch2);

I then pull out the cookies from the headers of both of those:

preg_match("/ASP.NET_SessionId=......................../", $output1, $cookie1);
preg_match("/__RequestVerificationToken_.{145}/", $output2, $cookie2);
$cookies = 'Cookie: ' . $cookie1[0] . '; ' . $cookie2[0];

And then attempt to download the file:

$headers = array ($cookies);
$file = fopen ('Report.pdf', 'w+');
$ch3 = curl_init($pdf_link_full);
curl_setopt($ch3, CURLOPT_SSL_CIPHER_LIST, 'AES128-SHA:RC2-CBC-MD5');
curl_setopt($ch3, CURLOPT_HTTPHEADER, $headers);
curl_setopt($ch3, CURLOPT_FILE, $file);
curl_setopt($ch3, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch3, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch3, CURLOPT_SSLVERSION, 4);
curl_setopt($ch3, CURLOPT_USERAGENT, $agent);
curl_setopt($ch3, CURLOPT_COOKIEFILE, "cookie.txt");
$output3 = curl_exec($ch3);
curl_close($ch3);

EDIT: If I manually set $pdf_link_full, it works. However, if I find it with preg_match (like above), it fails.

However, if I print $pdf_link_full and $pdf_link_full_2, they appear as the same exact thing. Am I missing encoding or something else here? Thanks!

MrPeanut
  • 207
  • 4
  • 15
  • 1
    Show your code so we can tell you how to fix the cookie options. – Barmar Jul 01 '15 at 19:55
  • If you use `CURLOPT_COOKIEFILE` and `CURLTOPT_COOKIEJAR`, cURL should take care of receiving and sending the cookies automatically. – Barmar Jul 01 '15 at 19:57
  • See http://stackoverflow.com/questions/23745468/curl-php-setting-cookies-properly/23747787#23747787 – Barmar Jul 01 '15 at 19:58
  • Sorry about that. I added code. I commented out the `$headers` and the `CURLOPT_HTTPHEADER` and I am still getting the same issue. – MrPeanut Jul 01 '15 at 20:01
  • So it turns out the problem is apparently due to my `preg_match`. See my edit. – MrPeanut Jul 01 '15 at 20:46
  • I don't see where you're setting `viewer_url_full` or `$pdf_link_full` using `preg_match. You're just setting `$cookies`. – Barmar Jul 01 '15 at 20:49
  • Sorry, I updated again. When I find `$pdf_link_full` using `preg_match`, it fails. When I declare `$pdf_link_full`, it works. When I print both, they appear as the same exact thing. – MrPeanut Jul 01 '15 at 20:49
  • Maybe there's an extra space at the end when you use the regexp? When you print them, you wouldn't notice this. – Barmar Jul 01 '15 at 20:56
  • Use `var_dump()` to see them with more detail. – Barmar Jul 01 '15 at 20:58
  • Ahhh, I think the problem was when I manually set it I was using ampersands (&) and preg_match was finding it with &. I didn't think it would matter, but `str_replace` seems to have fixed it. Thanks! – MrPeanut Jul 01 '15 at 21:04
  • Use `html_entity_decode()` instead of `str_replace`. – Barmar Jul 01 '15 at 21:04

1 Answers1

0

The issue was with my preg_match. It was returning a URL with & and, when I manually set it, I was using just the ampersand (&).

Replacing the & with & resolved the issue.

MrPeanut
  • 207
  • 4
  • 15