0

I am trying to crawl every page on my site (ran by a cron) to update data. There are roughly 500 pages.

I have tried 2 options.

  • PHP Simple HTML DOM Parser
  • PHP get_headers

Using either of the above, each page roughly takes 1.402 seconds to load. In total this takes about 570 seconds.

Is there a more efficient way of doing this?

danyo
  • 5,686
  • 20
  • 59
  • 119
  • Are you going through the web server with the requests, or the filesystem? – Luke Jun 23 '16 at 10:49
  • i am going through the web server – danyo Jun 23 '16 at 10:51
  • If possible then, try and load the files through the filesystem? See if that gives you any speed gains. The HTTP request and (apache, presumably) are probably slowing things down. – Luke Jun 23 '16 at 10:52

1 Answers1

0

Request pages in parallel (i.e. concurrently). Then it won't matter how long each request takes, because many will fire at once.

There are many ways to achieve this, but here is one example:

curl www.website.com/page1 &
curl www.website.com/page2 &
curl www.website.com/page3 &

Use xargs or other tools to prevent flooding the server with too many concurrent connections. e.g. Bash script processing commands in parallel

It can be complicated to run commands in parallel inside a single PHP script. Easier to use the command line, if possible.

Community
  • 1
  • 1
pythonjsgeo
  • 5,122
  • 2
  • 34
  • 47