5

was searching stackoverflow for a solution, but couldn't find anything even close to what I am trying to achieve. Perhaps I am just blissfully unaware of some magic PHP sauce everyone is doing tackling this problem... ;)

Basically I have an array with give or take a few hundred urls, pointing to different XML files on a remote server. I'm doing some magic file-checking to see if the content of the XML files have changed and if it did, I'll download newer XMLs to my server.

PHP code:

$urls = array(
    'http://stackoverflow.com/a-really-nice-file.xml',
    'http://stackoverflow.com/another-cool-file2.xml'
);
foreach($urls as $url){
    set_time_limit(0);
    $ch = curl_init();
    curl_setopt($ch, CURLOPT_URL, $url);
    curl_setopt($ch, CURLOPT_FAILONERROR, true);
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
    curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($ch, CURLOPT_BINARYTRANSFER, false);
    $contents = curl_exec($ch);
    curl_close($ch);
    file_put_contents($filename, $contents);
}

Now, $filename is set somewhere else and gives each xml it's own ID based on my logic. So far this script is running OK and does what it should, but it does it terribly slow. I know my server can handle a lot more and I suspect my foreach is slowing down the process.

Is there any way I can speed up the foreach? Currently I am thinking to up the file_put_contents in each foreach loop to 10 or 20, basically cutting my execution time 10- or 20-fold, but can't think of how to approach this the best and most performance kind of way. Any help or pointers on how to proceed?

hakre
  • 193,403
  • 52
  • 435
  • 836
David K.
  • 324
  • 1
  • 7
  • 20
  • 1
    i do something like this: http://stackoverflow.com/questions/6107339/parallel-processing-in-php-how-do-you-do-it one script per cpu core seems to work best for me. –  Oct 05 '12 at 23:08
  • @Wesley Murch: Hmm, actually there are just a handful of different servers. The performance I get is actually pretty good in terms of download speed of the xml's (which are only a few kb each). Basically I download the xml's, store the newest ones on my server, read them out and add their contents to my DB for fast indexing and search-ability on my frontend application. – David K. Oct 05 '12 at 23:09
  • @Dagon: Interesting. I'll take a closed look at it, but that solution would mean a massive re-coding of my current logic. The above script is of course just a brief explanation of what my code does, albeit the code for downloading and storing is the same. I'd just hate, if I have to take apart my whole logic and run multiple scripts ... the cross-checking to see which script does what and which xml's should go where would be immense ... or not? – David K. Oct 05 '12 at 23:13
  • @Wesley Murch: I trigger this script by cron, every 48 hours, but would like to do so, every 6-12 hours if I can overcome the performance issue ... the script runs for a flat 30 hours + another 1-3 hours to populate my database ... ;( – David K. Oct 05 '12 at 23:14
  • 1
    should not be a "massive" change, just spawn the scripts with a link set or id range to check each –  Oct 05 '12 at 23:16
  • you will want to limit how many you run in parallel i find 1 per cpu core optimal, any more and they started to get slower as a total - but test your hardware and see. Assuming the machine is not doing anything else. –  Oct 05 '12 at 23:25
  • Can you just spawn wget in the shell? That sounds like it'd be faster. – aknosis Oct 05 '12 at 23:48
  • Well, how do I find out how many CPU cores I have or better say, how do I assign a script to a specific CPU core? – David K. Oct 06 '12 at 11:08
  • what sort of hosting do you have? i would be surprised if a shared host allowed something like this to run. –  Oct 07 '12 at 03:22
  • I have my own instances on Amazon as well as couple of VPS that handle some larger databases. the script is run from my most performant VPS and pushes the data to a VPS configured to handle the MySQL database. The front-end is a plain webhoster, but no scripts or databases are run from there, just outputted. – David K. Oct 08 '12 at 10:23

3 Answers3

6

Your bottleneck (most likely) is your curl requests, you can only write to a file after each request is done, there is no way (in a single script) to speed up that process.

I don't know how it all works but you can execute curl requests in parallel: http://php.net/manual/en/function.curl-multi-exec.php.

Maybe you can fetch the data (if memory is available to store it) and then as they complete fill in the data.

aknosis
  • 3,602
  • 20
  • 33
  • 1
    There is nice tutorial about this http://www.phpied.com/simultaneuos-http-requests-in-php-with-curl/ – Petr Oct 06 '12 at 10:20
  • I think I am going to go with this answer. Just waiting a little bit to accept the answer. Perhaps someone may just have an easier solution to my above code... – David K. Oct 06 '12 at 11:11
2

Just run more script. Each script will download some urls.

You can get more information about this pattern here: http://en.wikipedia.org/wiki/Thread_pool_pattern

The more script your run the more parallelism you get

dynamic
  • 46,985
  • 55
  • 154
  • 231
  • I'm afraid that is not the solution I was looking for. Running multiple scripts in parallel would kill my application logic (or at least I have to rewrite most of my application code to make that work) ... Didn't downvote, but just wanted to clarify ... – David K. Oct 05 '12 at 23:17
  • that's how parallelism in big engine works... (i would say almost everyone works) http://en.wikipedia.org/wiki/Thread_pool_pattern also having little php script that you can launch in parallel make their managment much easier – dynamic Oct 05 '12 at 23:33
0

I use on paralel requests guzzle pool ;) ( you can send x paralel request)

http://docs.guzzlephp.org/en/stable/quickstart.html

Lukáš Kříž
  • 630
  • 7
  • 5