0

I am attempting to scrape data from a new window (using Javascript's window.open()) that is generated by the site I am posting to via cUrl, but I am unsure how to go about this.

The target site only generates this needed data when certain parameters are posted to it, and no other way.

The following code simply dumps the result of the cUrl request, but the result does not contain any data that is relevant.

My code:

//build post data for request
$proofData = array("formula" => $formula,
                     "proof" => $proof,
                    "action" => $action);
$postProofData = http_build_query($proofData);

$ch = curl_init($url); //open connection

//sort curl settings for request
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_POST, 3);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postProofData);

//obtain data from LigLab
$result = curl_exec($ch);

//finish connection 
curl_close($ch);


echo "forumla: " . $formula;
var_dump($result);

The following code is what is generated

Target site's code:

var proof =  "<?php echo str_replace("\n","|",$annoted_proof) ?>";
var lines = proof.split('|');
proof_window=window.open("","Proof and Justifications","scrollbar=yes,resizable=yes, titlebar=yes,menubar=yes,status=yes,width= 800, height=800, alwaysRaised=yes");

for(var i = 0;i < lines.length;i++){
    proof_window.document.write(lines[i]);
    proof_window.document.write("\n");
}

I want to scrape the lines variable but it is generated after page load and after user interaction.

neonite
  • 5
  • 9

1 Answers1

0

You can't parse processed javascript code with curl.

You have to use a headless browser, which emulates a real browser with events (clicks, hover and javascript code)

you can start here http://www.simpletest.org/en/browser_documentation.html or here PHP Headless Browser?

Flo
  • 356
  • 1
  • 11