2

I'm trying to start a web crawler. Below is the starting code,

<?php
$start = "https://www.yocale.com/Search?latitude=29.748093&longitude=-95.37127699999996";

function follow($url)
{
  $content = file_get_contents($url);
  $content = str_replace('src="/', 'src="https://www.yocale.com/', $content);
  $content = str_replace('href="/', 'href="https://www.yocale.com/', $content);
  $content = str_replace('src="https://www.yocale.com//maps.googleapis.com', 'src="//maps.googleapis.com', $content);
  $content = str_replace("url: '/", "url: 'https://www.yocale.com/", $content);
  $content = str_replace("= '/", "= 'https://www.yocale.com/", $content);

  echo $content;
}

follow($start);

From the given code, it will successfully render the html in the browser and call files that is required such as javascript.

Part of the javascript is the ajax call using this request

https://www.yocale.com/Search?distance=25km&latitude=29.748093&longitude=-95.37127699999996&_=1525228859581

It doesn't fetch any data, I know it has to do with CORS, and it is in the log,

Failed to load https://www.yocale.com/Search?distance=25km&latitude=29.748093&longitude=-95.37127699999996&_=1525228859581: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://search.oo' is therefore not allowed access.

Among other request such as fonts

Is there a way to crawl this page that render some of the data using ajax or similar in the browser using php?

Fil
  • 8,225
  • 14
  • 59
  • 85
  • https://stackoverflow.com/questions/3076414/ways-to-circumvent-the-same-origin-policy – Kisaragi May 02 '18 at 03:13
  • I'd use a web driver, like [this](https://github.com/facebook/php-webdriver) to let the page render properly before crawling. But I'm sure there are other ways as well. – csb May 02 '18 at 03:16
  • @csb Is there, just a php library? that can be a problem if the hosting don't have java – Fil May 02 '18 at 08:34
  • @Fil Unfortunately, not that I know of. – csb May 02 '18 at 08:39

1 Answers1

0

Insert a new line between lines 1 and 2 with the following

header('Access-Control-Allow-Origin: https://www.somewebsite.com', false);

So something like this:

<?php
header('Access-Control-Allow-Origin: https://www.somewebsite.com', false);
start = "https://www........

It's very important for the header modification/allowance to appear at the very beginning of the html or php file, as this header can not set the session up correctly if it comes after any HTML elements.

c.diaz
  • 17
  • 5