3

I have created a very simple web crawler in PHP, where I crawl some soccer sites for match results.

But when I crawl a website, it takes about 0.5 - 1 second to crawl it. So if I have a lot of urls to crawl it will take a lot of time.

This is my code start for crawling the site:

$doc = new DOMDocument();
$doc->loadHTMLFile("http://resultater.dai-sport.dk/tms/Turneringer-og-resultater/Pulje-Stilling.aspx?PuljeId=229");
$xpath = new DOMXpath($doc);

I have created the crawler myself, so maybe there is a better way to do this or a quicker way? Or maybe my expectations about the speed is to high?

David Ansermot
  • 6,052
  • 8
  • 47
  • 82
Andreas Baran
  • 669
  • 1
  • 12
  • 27
  • Network connection timing has [various types](https://developer.chrome.com/devtools/docs/network#resource-network-timing). 0.5 ~ 1 second is the overall time? – Raptor Apr 20 '15 at 08:51
  • 4
    You can use threads to crawl several pages at the same time. – Iván Pérez Apr 20 '15 at 08:51
  • where the crawlers runs? from your local pc? what kind of connection do you have? – rvandoni Apr 20 '15 at 08:52
  • 1
    There are several strategies to reduce loading speed, including skip loading images, static DNS, caching, etc. – Raptor Apr 20 '15 at 08:52
  • look good enough to me. don't process them sequentially... – Karoly Horvath Apr 20 '15 at 08:53
  • Half second timing for parsing DOM document inside code that is not compiled is still pretty fast. PHP is functional, but not fast, it would be better to write a crawler in a compilable language, eg. C. @IvánPérez suggested threads - its a best clue if you want to stick to PHP. – yergo Apr 20 '15 at 08:53
  • @IvánPérez - I will look at multithreading - is there a limit to how many threads to run at same time? – Andreas Baran Apr 20 '15 at 09:01
  • Since you are wanting to crawl many URLs in one site, and not one URL per many sites, you need to slow down, not speed up. If you regularly scrape content at a fast rate you can expect to be IP blocked. Put a few seconds pause between each HTTP operation, and call the script on a cron job. – halfer Apr 21 '15 at 08:47
  • @halfer - why will it help to make a pause between each http operation? My IP will still call the site many times. Is there something about time and HTTP operations? – Andreas Baran Apr 21 '15 at 14:49
  • If you crawl too fast on a good connection, you'll be (accidentally) performing a denial of service attack. You may be subject to automated or manual IP/range blocking. – halfer Apr 21 '15 at 15:21

2 Answers2

1

Please check this lib for kind of asynchronous realization of your crawler. It uses "yield", appeared in PHP 5.5: https://github.com/icicleio/Icicle

You will find usage example in library examples.

halfer
  • 19,824
  • 17
  • 99
  • 186
Anton
  • 1,029
  • 7
  • 19
  • Basically this is just a way of wrapping pcntl_fork functions in a library. When feeling lucky use pcntl_fork and code your own (shell) scripts. – twicejr Apr 20 '15 at 11:38
  • 1
    But to use pcntl extension functionality, you must have it installed on server. Icicle can work without it. – Anton Apr 21 '15 at 06:16
  • [Icicle](https://github.com/icicleio/Icicle) doesn't use `pcntl_fork()` at all. It uses non-blocking I/O to schedule asynchronous tasks. It leverages `yield` to create interruptible functions. The OP could use it to crawl many sites at once, since most of the his script's runtime is consumed by blocking on network operations. – Trowski Apr 24 '15 at 15:50
0

if you are not planning to use any ready module, the way you did it is good, just make sure to parse the url once. here is an example of an older post: How do I make a simple crawler in PHP?

if you decide to test ready modules, refere to http://phpcrawl.cuab.de/ it is a very good option

Community
  • 1
  • 1