1

From everything I've read, it seems that this is an impossible. But here is my scenario:

I need to scrape a table's content containing for sale housing information. The page is not password protected or anything, but you first have to click an "I Agree" link on the previous page so that a cookie gets set saying you agree that the content may not be 100% accurate. You are only then shown the data. Is there any way at all to accomplish this using php/jquery/javascript? I know you cannot create an iframe because of the fact that it is cross-domain. I also do not have access to this other website.

Thanks for any answers, as I'm not really expecting anything positive. :) And many thanks if you can tell me how to do this. :D

James
  • 3,765
  • 4
  • 48
  • 79
  • You can use cURL to get the data you need... – Brian Driscoll Feb 12 '13 at 21:47
  • 1
    Cookies are sent with the header of an HTTP Request. – Shmiddty Feb 12 '13 at 21:47
  • What you are doing sounds kind of shady but cURL is definately a good option as previous commenters have mentioned. – marteljn Feb 12 '13 at 21:49
  • @marteljn haha, I know it definitely could be, but in this case, all I'm doing is pulling a list of forclosed houses that the county is putting up for auction off of the county's website. – James Feb 12 '13 at 21:56
  • 1
    Possibly duplicate of http://stackoverflow.com/questions/13210140/how-can-i-scrape-website-content-in-php-from-a-website-that-requires-a-cookie-lo - check that out. – LSerni Feb 12 '13 at 22:05

2 Answers2

3

Use server side script (PHP using cURL) to crawl the website and return the information you need. Make sure you set the appropriate HTTP header with your request that represents the "I agree" cookie.

Sample:

<?php

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, 'http://www.example.com/');
curl_setopt($ch, CURLOPT_COOKIE, 'I_Agree=1');
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);

$responseBody = curl_exec($ch);

curl_close($ch);

// Read the information you need from $responseBody and return it as response body

?>

Now you can access the information from your website by calling your server side script above. For details about how to use cURL take a look at the documentation.

CodeZombie
  • 5,367
  • 3
  • 30
  • 37
  • How could I set that header? Within the cURL command, or just a regular php header? And if just the php header, would that work cross-domain? – James Feb 12 '13 at 21:55
  • @James: Take a look at my example. As you request the external site from your server and return it to your website, all you do from a browsers perspective is accessing pages from a single domain. – CodeZombie Feb 12 '13 at 22:02
1

CURL can store or recall cookies from a file depending on the options you set. Here is the "cookiejar" example:

http://curl.haxx.se/libcurl/php/examples/cookiejar.html

Check out the CURLOPT_COOKIEJAR and CURLOPT_COOKIEFILE options

James L.
  • 4,032
  • 1
  • 15
  • 15
  • In this case you need to dig into the 'network' tab on the console. Mimick the exact behavior of the ajax request (send the same headers). Form your first CURL request to hit the ajax uri and store the return headers in the COOKIEJAR. Then make the second request pulling from COOKIEFILE. Same process, but you need to hit the ajax address. Definitely doable - i've used it scrape ajax comments before. – James L. Feb 12 '13 at 21:56
  • Thanks for the help, but this ended up being more than I needed (though I couldn't get it to work in the first place anyway, for some reason [probably due to server side scripting on the other website]). I +1'd it anyway because I checked the examples and know that it could have worked. – James Feb 12 '13 at 22:26