crawling page does only work for 3%

Question

I am trying to crawl a full section of a website, but the problem is that the data that I need is not there from the start. Is there anyway to get the data from the website with PHP?

this is the link: https://www.iamsterdam.com/nl/uit-in-amsterdam/uit/agenda and this is the section I need:

After my post was set to duplicate I tried this https://stackoverflow.com/a/28506533/7007968 but is also doesn't work so I need a other solucion this is what I tried:

get-website.php

$phantom_script= 'get-website.js'; 


$response =  exec ('phantomjs ' . $phantom_script);

echo  $response;

get-website.js

var webPage = require('webpage');
var page = webPage.create();

page.open('https://www.iamsterdam.com/nl/uit-in-amsterdam/uit', function(status) {
  console.log(page.content);
  phantom.exit();
});

this is all I get back (around 3% of the page):

</div><div id="ads"></div><script src="https://analytics.twitter.com/i/adsct?p_id=Twitter&amp;p_user_id=0&amp;txn_id=nvk6a&amp;events=%5B%5B%22pageview%22%2Cnull%5D%5D&amp;tw_sale_amount=0&amp;tw_order_quantity=0&amp;tw_iframe_status=0&amp;tpx_cb=twttr.conversion.loadPixels" type="text/javascript"></script></body></html>

So I have the feeling that i am getting closer this is what I after a lot of searching:

var webPage = require('webpage');
var page = webPage.create();
var settings = {
  operation: "POST",
  encoding: "utf8",
  headers: {
    "Content-Type": "application/json"
  },
  data: JSON.stringify({
    DateFilter: 04112016,
    LastMinuteTickets:  0,
    PageId: "3418a37d-b907-4c80-9d67-9fec68d96568",
    Skip: 0,
    Take:   12,
    ViewMode: 1
  })
};

page.open('https://www.iamsterdam.com/api/AgendaApi/', settings, function(status) {
  console.log(page.content);
  phantom.exit();
});

But what I get back doesn't look good:

Message":"An error has occurred.","ExceptionMessage":"Page could not be found","ExceptionType":"System.ApplicationException","StackTrace":" at Axendo.SC.AM.Iamsterdam.Controllers.Api.AgendaApiController.GetResultsInternal(RequestModel requestModel)\r\n at lambda_method(Closure , Object , Object[] )\r\n

etc.

I hope somewann can help me,

See my answer about the missing 97% of the page. As to why the API won't work, it could really be that you're really looking for the missing page. If you want to discuss it further, you should ask about API issues in another question. — Vaviloff, Nov 03 '16 at 02:13

score 1 · Accepted Answer · answered Nov 03 '16 at 02:09

Addressing your main question about 3%. You use exec incorrectly. When used like this

$response =  exec ('phantomjs ' . $phantom_script);

$response will containt the last line of what was printed in terminal during execution of a given command. Because you did console.log(page.contents); the last line of HTML document was placed into $response variable.

The correct use of exec would be

exec ('phantomjs ' . $phantom_script, $response);

This way the result will be placed into $response variable as an array, with each line an element of the array. Then, if you just want to get html, you can do

$html = implode("\n", $response);

But a more simple and correct way is to use the specific function for the task:

passthru ('phantomjs ' . $phantom_script);

passthru executes a function and returns recieved data unmodified, straight to the output.

So if you want to contain it to a variable, do:

ob_start();
passthru ('phantomjs ' . $phantom_script);
$html = ob_get_clean();

I will make a other question about the api because the api will be a lot faster but for now you're answer helped me a lot thx — N. Smeding, Nov 03 '16 at 08:28
Here is the link for if you have some time for me http://stackoverflow.com/q/40397214/7007968 — N. Smeding, Nov 03 '16 at 08:55

crawling page does only work for 3%

1 Answers1