2

I have a web portal that needs to download many of separate json files and display their contents in a sort of form view. By lots I mean 32 separate files minimum.

I've tried cUrl with brute force iteration and its taking ~12.5 seconds.

I've tried curl_multi_exec as demonstrated here http://www.php.net/manual/en/function.curl-multi-init.php with the function below and its taking ~9 seconds. A little better but still terribly slow.

function multiple_threads_request($nodes){
    $mh = curl_multi_init();
    $curl_array = array();
    foreach($nodes as $i => $url)
    {
        $curl_array[$i] = curl_init($url);
        curl_setopt($curl_array[$i], CURLOPT_RETURNTRANSFER, true);
        curl_multi_add_handle($mh, $curl_array[$i]);
    }
    $running = NULL;
    do {
        curl_multi_exec($mh,$running);
    } while($running > 0);

    $res = array();
    foreach($nodes as $i => $url)
    {
        $res[$url] = curl_multi_getcontent($curl_array[$i]);
    }

    foreach($nodes as $i => $url){
        curl_multi_remove_handle($mh, $curl_array[$i]);
    }
    curl_multi_close($mh);
    return $res;
}

I realize this is an inherently expensive operation but does anyone know any other alternatives that might faster?

EDIT: In the end, my system was limiting curl_multi_exec and moving the code to a production machine saw dramatic improvements

Brad
  • 6,106
  • 4
  • 31
  • 43
  • 2
    Must it be done in the backend? Why don't you move the fetching to the client? Some AJAX magic, backbone views and your basically done. In waaaay less then 9 seconds. ;-) – nietonfir Jan 06 '14 at 21:06
  • yeah...i thought about that but i'd have to refactor easily ~75% of my work which I can't do in the short term. Longer term, that's definitely what I should do. – Brad Jan 06 '14 at 21:08
  • 3
    Is caching an option? Either explicitly, or implicitly. (It is largely orthogonal to moving fetches to the client-side.) – user2864740 Jan 06 '14 at 21:09
  • Well, must the information provided by those sources be accurate? You could fetch those in a cron job (e.g. every minute or every hour, depending on your needs) and include the aggregated JSON in your serverside code. – nietonfir Jan 06 '14 at 21:11
  • does not sound like something i would do per user request, cache\stoer, retrive perodically –  Jan 06 '14 at 21:11
  • Try it with 32 times the same fast site, let's say google.com. Still 9 sec? – hek2mgl Jan 06 '14 at 21:12
  • @user2864740 cannot be cached. Must be accurate to the moment in time. again, I know its inherently expensive....just trying to do the least crappy thing available – Brad Jan 06 '14 at 21:13
  • 1
    If you went from 12.5 to 9 seconds then does that mean your longest cURL request is 9 seconds and the other 31 are less than 9? Is there a cURL limit when using `curl_multi_init()`? – MonkeyZeus Jan 06 '14 at 21:13
  • 2
    @user2191572 `does that mean your longest cURL request is 9 seconds` yeah, this is likely – hek2mgl Jan 06 '14 at 21:14
  • @hek2mgl - tried that. slower. 14.3 sec – Brad Jan 06 '14 at 21:15
  • @user2191572 good question. This is a blocking call for the page being built so I'm not sure. I'm measuring with google developer tools Network tab...have a better suggestion there? – Brad Jan 06 '14 at 21:16
  • Have you tried to investigate using wireshark? sometimes it's a slow DNS which causes that – hek2mgl Jan 06 '14 at 21:16
  • Well, I'd still cache it and fetch it periodically. 30 seconds should give you accurate enough data and enough time to fetch. – nietonfir Jan 06 '14 at 21:16
  • @hek2mgl yea it sounds like it so if it is true then the only reasonable answer is for OP to cURL with sites that have better bandwidth or maybe OP's bandwidth isn't up to par. – MonkeyZeus Jan 06 '14 at 21:16
  • all the files are less than 100kb – Brad Jan 06 '14 at 21:18
  • @Brad, are you familiar with wireshark? – hek2mgl Jan 06 '14 at 21:18
  • @hek2mgl Seen and heard of it, never used it...i'll check it out – Brad Jan 06 '14 at 21:19
  • I would give it a try. It's pretty self explaining and you need to see the network traffic under the hood in order to fix the problem. Another idea would be to run your script using `strace php multi.php` and post the (large) output to a pastebin – hek2mgl Jan 06 '14 at 21:21
  • @hek2mgl check out the comments under the benchmarking answer I provided – MonkeyZeus Jan 06 '14 at 21:27

1 Answers1

2

You should definitely look into benchmarking your cURLs to see which one has the slowdown but this was too lengthy for a comment so let me know if it helps or not:

// revert to "cURLing with brute force iteration" as you described it :)

$curl_timer = array();

foreach($curlsite as $row)
{
    $start = microtime(true);

    /**
     * curl code
     */

    $curl_timer[] = (microtime(true)-$start);
}

echo '<pre>'.print_r($curl_timer, true).'</pre>';
MonkeyZeus
  • 20,375
  • 4
  • 36
  • 77