1

I have a database of content with free text in it There are about 11000 rows of data, and each row has 87 columns. There are thus (potentially) around 957000 fields to check if URLs are valid.

I did a regular expression to extract all things that look like URLs (http/s, etc.) and built up an array called $urls. I then loop through it, passing each $url to my curl_exec() call.

I have tried cURL (for each $url):

$ch = curl_init();
curl_setopt($ch, CURLOPT_CONNECTTIMEOUT_MS, 250);
curl_setopt($ch, CURLOPT_NOBODY, 1);
curl_setopt($ch, CURLOPT_FAILONERROR, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_CONNECT_ONLY, 1);
curl_setopt($ch, CURLOPT_HEADER, 1);
curl_setopt($ch, CURLOPT_HTTPGET, 1);
foreach ($urls as $url) {
    curl_setopt($ch, CURLOPT_URL, $url);
    $exec = curl_exec($ch);
    // Extra stuff here... it does add overhead, but not that much.
}
curl_close($ch);

As far as I can tell, this SHOULD work and be as fast as I can go, but it takes around 2-3 seconds per URL.

There has to be a faster way?

I am planning on running this via a cron job, and then check my local database first if this URL has been checked in the last 30 days, and if not, then check, so over time this will become less, but I just want to know if cURL is the best solution, and whether I am missing something to make it faster?

EDIT: Based on the comment bby Nick Zulu below, I sit with this code now:

function ODB_check_url_array($urls, $debug = true) {
  if (!empty($urls)) {
    $mh = curl_multi_init();
    foreach ($urls as $index => $url) {
      $ch[$index] = curl_init($url);
      curl_setopt($ch[$index], CURLOPT_CONNECTTIMEOUT_MS, 10000);
      curl_setopt($ch[$index], CURLOPT_NOBODY, 1);
      curl_setopt($ch[$index], CURLOPT_FAILONERROR, 1);
      curl_setopt($ch[$index], CURLOPT_RETURNTRANSFER, 1);
      curl_setopt($ch[$index], CURLOPT_CONNECT_ONLY, 1);
      curl_setopt($ch[$index], CURLOPT_HEADER, 1);
      curl_setopt($ch[$index], CURLOPT_HTTPGET, 1);
      curl_multi_add_handle($mh, $ch[$index]);
    }
    $running = null;
    do {
      curl_multi_exec($mh, $running);
    } while ($running);
    foreach ($ch as $index => $response) {
      $return[$ch[$index]] = curl_multi_getcontent($ch[$index]);
      curl_multi_remove_handle($mh, $ch[$index]);
      curl_close($ch[$index]);
    }
    curl_multi_close($mh);
    return $return;
  }
}
hanshenrik
  • 19,904
  • 4
  • 43
  • 89
Kobus Myburgh
  • 1,114
  • 1
  • 17
  • 46
  • Possible duplicate of [Ping site and return result in PHP](https://stackoverflow.com/questions/1239068/ping-site-and-return-result-in-php) – Martijn Jan 24 '19 at 13:04
  • Why not use curl_multi_init which allows the processing of multiple cURL handles asynchronously? plz check http://php.net/manual/en/function.curl-multi-init.php – Nick Zulu Jan 24 '19 at 13:05
  • Hi @Martijn, I have seen many of these, and my question is whether I can make what I have faster. There is not much difference in what your reference post's implementation does and what mine does. My question though is if I am doing this as fast as possible, and if not, what can I adjust to make it faster? It is an insane lot of URLs to go through, so can't take 2-3 seconds per URL. – Kobus Myburgh Jan 24 '19 at 13:13
  • Hi @NickZulu, I am processing a large amount of URLs as per my question. Will this work to initiate hundreds of URLs? Will this different implementation have significant impact on the server making the requests? – Kobus Myburgh Jan 24 '19 at 13:14
  • 1
    @KobusMyburgh yes it can handle many requests. You can try it at a local server, i.e. your pc to see how much load it actually uses. I have tried that on a production server and had no issues – Nick Zulu Jan 24 '19 at 13:29
  • @NickZulu, is it realistic that all the `curl_multi_getcontent($ch[$index])` have blank results? I am following the example on the page you sent here: http://php.net/manual/en/function.curl-multi-init.php#118142 – Kobus Myburgh Jan 24 '19 at 14:23
  • @KobusMyburgh maybe this can help https://stackoverflow.com/questions/18796693/php-curl-multi-getcontent-returns-null – Nick Zulu Jan 24 '19 at 15:11
  • 1
    your current OOB function doesn't scale, it creates a separate curl handle for every url, you'll run out of resources (resource limit or connection limit or available memory) that way. – hanshenrik Jan 24 '19 at 16:03

2 Answers2

2

let's see..

  • use the curl_multi api (it's the only sane choice for doing this in PHP)

  • have a max simultaneous connection limit, don't just create a connection for each url (you'll get out-of-memory or out-of-resource errors if you just create a million simultaneous connections. and i wouldn't even trust the timeout errors if you just created a million connections simultaneously)

  • only fetch the headers, because downloading the body would be a waste of time and bandwidth

here is my attempt:

// if return_fault_reason is false, then the return is a simple array of strings of urls that validated.
// otherwise it's an array with the url as the key containing  array(bool validated,int curl_error_code,string reason) for every url
function validate_urls(array $urls, int $max_connections, int $timeout_ms = 10000, bool $consider_http_300_redirect_as_error = true, bool $return_fault_reason) : array
{
    if ($max_connections < 1) {
        throw new InvalidArgumentException("max_connections MUST be >=1");
    }
    foreach ($urls as $key => $foo) {
        if (!is_string($foo)) {
            throw new \InvalidArgumentException("all urls must be strings!");
        }
        if (empty($foo)) {
            unset($urls[$key]); //?
        }
    }
    unset($foo);
    $urls = array_unique($urls); // remove duplicates.
    $ret = array();
    $mh = curl_multi_init();
    $workers = array();
    $work = function () use (&$ret, &$workers, &$mh, &$return_fault_reason) {
        // > If an added handle fails very quickly, it may never be counted as a running_handle
        while (1) {
            curl_multi_exec($mh, $still_running);
            if ($still_running < count($workers)) {
                break;
            }
            $cms=curl_multi_select($mh, 10);
            //var_dump('sr: ' . $still_running . " c: " . count($workers)." cms: ".$cms);
        }
        while (false !== ($info = curl_multi_info_read($mh))) {
            //echo "NOT FALSE!";
            //var_dump($info);
            {
                if ($info['msg'] !== CURLMSG_DONE) {
                    continue;
                }
                if ($info['result'] !== CURLM_OK) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $info['result'], "curl_exec error " . $info['result'] . ": " . curl_strerror($info['result']));
                    }
                } elseif (CURLE_OK !== ($err = curl_errno($info['handle']))) {
                    if ($return_fault_reason) {
                        $ret[$workers[(int)$info['handle']]] = array(false, $err, "curl error " . $err . ": " . curl_strerror($err));
                    }
                } else {
                    $code = (string)curl_getinfo($info['handle'], CURLINFO_HTTP_CODE);
                    if ($code[0] === "3") {
                        if ($consider_http_300_redirect_as_error) {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " redirect, which is considered an error");
                            }
                        } else {
                            if ($return_fault_reason) {
                                $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " redirect, which is considered a success");
                            } else {
                                $ret[] = $workers[(int)$info['handle']];
                            }
                        }
                    } elseif ($code[0] === "2") {
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(true, 0, "got a http " . $code . " code, which is considered a success");
                        } else {
                            $ret[] = $workers[(int)$info['handle']];
                        }
                    } else {
                        // all non-2xx and non-3xx are always considered errors (500 internal server error, 400 client error, 404 not found, etcetc)
                        if ($return_fault_reason) {
                            $ret[$workers[(int)$info['handle']]] = array(false, -1, "got a http " . $code . " code, which is considered an error");
                        }
                    }
                }
                curl_multi_remove_handle($mh, $info['handle']);
                assert(isset($workers[(int)$info['handle']]));
                unset($workers[(int)$info['handle']]);
                curl_close($info['handle']);
            }
        }
        //echo "NO MORE INFO!";
    };
    foreach ($urls as $url) {
        while (count($workers) >= $max_connections) {
            //echo "TOO MANY WORKERS!\n";
            $work();
        }
        $neww = curl_init($url);
        if (!$neww) {
            trigger_error("curl_init() failed! probably means that max_connections is too high and you ran out of resources", E_USER_WARNING);
            if ($return_fault_reason) {
                $ret[$url] = array(false, -1, "curl_init() failed");
            }
            continue;
        }
        $workers[(int)$neww] = $url;
        curl_setopt_array($neww, array(
            CURLOPT_NOBODY => 1,
            CURLOPT_SSL_VERIFYHOST => 0,
            CURLOPT_SSL_VERIFYPEER => 0,
            CURLOPT_TIMEOUT_MS => $timeout_ms
        ));
        curl_multi_add_handle($mh, $neww);
        //curl_multi_exec($mh, $unused_here); LIKELY TO BE MUCH SLOWER IF DONE IN THIS LOOP: TOO MANY SYSCALLS
    }
    while (count($workers) > 0) {
        //echo "WAITING FOR WORKERS TO BECOME 0!";
        //var_dump(count($workers));
        $work();
    }
    curl_multi_close($mh);
    return $ret;
}

here is some test code

$urls = [
    'www.example.org',
    'www.google.com',
    'https://www.google.com',
];
var_dump(validate_urls($urls, 1000, 1, true, false));

returns

array(0) {
}

because they all timed out (1 millisecond timeout), and fail reason reporting was disabled (that's the last argument),

$urls = [
    'www.example.org',
    'www.google.com',
    'https://www.google.com',
];
var_dump(validate_urls($urls, 1000, 1, true, true));

returns

array(3) {
  ["www.example.org"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
  ["www.google.com"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
  ["https://www.google.com"]=>
  array(3) {
    [0]=>
    bool(false)
    [1]=>
    int(28)
    [2]=>
    string(39) "curl_exec error 28: Timeout was reached"
  }
}

increasing the timeout limit to 1000 we get

var_dump(validate_urls($urls, 1000, 1000, true, false));

=

array(3) {
  [0]=>
  string(14) "www.google.com"
  [1]=>
  string(22) "https://www.google.com"
  [2]=>
  string(15) "www.example.org"
}

and

var_dump(validate_urls($urls, 1000, 1000, true, true));

=

array(3) {
  ["www.google.com"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
  ["www.example.org"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
  ["https://www.google.com"]=>
  array(3) {
    [0]=>
    bool(true)
    [1]=>
    int(0)
    [2]=>
    string(50) "got a http 200 code, which is considered a success"
  }
}

and so on :) the speed should depend on your bandwidth and $max_connections variable, which is configurable.

hanshenrik
  • 19,904
  • 4
  • 43
  • 89
  • Thank you, @hanshenrik. This solved the problem for me. Very comprehensive answer. – Kobus Myburgh Jan 30 '19 at 13:08
  • 1
    @KobusMyburgh you should refresh, i made a small but significant performance improvement to the code, it should run much faster on large lists now – hanshenrik Feb 15 '19 at 22:30
  • Can I return the URLs only, that's without array? If so, how? –  Apr 15 '20 at 06:55
  • @ajeshkdy yeah just set the `$return_fault_reason` argument to false – hanshenrik Apr 15 '20 at 10:37
  • @hanshenrik Thanks for the fast reply. Actually I want to remove the array from the result and echo only the URLs if it "is considered a success". That is, the results be like "www.google.com", "https ://www.google.com", "www.example.org" if it's a success, else echo nothing. Sorry for not mentioning this earlier. –  Apr 15 '20 at 16:45
  • @ajeshkdy guess you're new to php? welcome ^^ anyway, it's var_dump() that is adding the "array" part, try: ```foreach($result as $url){ echo '"'.$url.'", '; }``` – hanshenrik Apr 15 '20 at 17:20
1

This is the fastest I could get it real quick, by using a tiny ping:

$domains = ['google.nl', 'blablaasdasdasd.nl', 'bing.com'];
foreach($domains as $domain){
    $exists = null!==shell_exec("ping ".$domain." -c1 -s1 -t1");
    echo $domain.' '.($exists?'exists':'gone');
    echo '<br />'.PHP_EOL;
}

c-> count (1 is enough)
s-> size (1 is al we need)
t-> timeout -> timeout when no response. You might want to tweak this one.

Please keep in mind that some servers don't respond to ping. I dont know a percentage which do, but I suggest implementing a better 2nd check for all those that fail the ping check , should be a significantly less result.

Martijn
  • 15,791
  • 4
  • 36
  • 68
  • Thank you - I was just about to ask about the servers that do not allow ping. In that case I will use cURL. A lot of those seem to be errors with SSL certficates as well. – Kobus Myburgh Jan 24 '19 at 13:20
  • Maybe this isnt a complete solution, but it should sift through the bulk fairly quick, narrowing the set that requires heavier checks. – Martijn Jan 24 '19 at 13:22
  • 1
    on the other hand, this won't catch http 404 not found, http 405 gone, http 500 internal server errors, etc – hanshenrik Jan 24 '19 at 18:31
  • 1
    Nope. But if I have to check a million domains, I prefer to check them ASAP for at least connectivity. And the remaining list can get checked with heavier tests. – Martijn Jan 24 '19 at 19:13
  • not all domains have ping enabled – Radek Feb 04 '22 at 08:18