1

I have a task to crawl 100000 URLs and save data in excel sheet & store images in seprate directories, I have written script using Simple HTML DOM and PHP Excel, it processes 4-URLs in 1 minutes approx but become slow gradually as time passes more & more.

I want to make it more faster. I am using OOP approach and have divided different child process into small functions that also helps me to keep variable free of memory. I am using local resources XAMPP on Windows.

Please let me know how to speedup execution to do more & more in less time.

Thanks.

Mian Majid
  • 23
  • 1
  • 6
  • 1
    You'll need to share some of your code to get any real advice. – Geoherna Nov 21 '16 at 09:45
  • You should think about a way to enqueue all the sites to be processed and run your crawler in parallel, consuming the queue. – Tom Nov 21 '16 at 09:53
  • Please provide more information about your problem. If you have a list of 100,000 urls to grab up front. You could use something else to grab the contents of each, and then process later. – Progrock Nov 21 '16 at 10:34

1 Answers1

1

Your bottleneck right will probably be the network latency if you run local as you said. Your process will wait for response before it can go to the next url. To make the most of your local network connection you can have multiple processes run at the same time.

I'm not sure if this is what you meant with 'child process' but you can do multithreading in php (see question). Or just start your php script multiple times on the command line with a part of the workload.

If the network is still the bottleneck you can consider paying for a server which has better network performance. And of course when CPU becomes the bottleneck, you need a better server for better performance ;)

That said, don't expect it to be fast on just a single server.

Community
  • 1
  • 1
Wouter de Winter
  • 701
  • 7
  • 11