Quickest & Efficient way of retrieving article final URL and images

Question

I've written a PHP script to parse an RSS feed and try and get the open graph images from the og:image meta tags.

In order to get the images I need to check if the urls in the RSS feed are 301 redirects. This often happens and it means I need to follow any redirects to the resultant URLs. That means the script runs really slowly. Is there a quicker and more efficient way of achieving this?

Here is the function for getting the final URL:

function curl_get_contents($url) {
$agent= 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.0.3705; .NET CLR 1.1.4322)';
$ch = curl_init();
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);
curl_setopt($ch, CURLOPT_VERBOSE, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_USERAGENT, $agent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
$result=curl_exec($ch);
return $result; 
}

And this is the function to retrieve the og images (if they exist):

function getog($url) {
    $doc = new DomDocument();
    $doc->loadHTML(curl_get_contents($url));
    if($doc == "") {return;}
    $xpath = new DOMXPath($doc);
    $query = '//*/meta[starts-with(@property, \'og:\')]';
    $queryT = '';
    $metas = $xpath->query($query);
    foreach ($metas as $meta) {
        $property = $meta->getAttribute('property');
        $content = $meta->getAttribute('content');
        if($property == "og:url"   && $ogProperty['url'] == "")     {$ogProperty['url'] = $content;}
        if($property == "og:title" && $ogProperty['title'] == "")   {$ogProperty['title'] = $content;}
        if($property == "og:image" && $ogProperty['image'] == "")   {$ogProperty['image'] = $content;}
    }
    return $ogProperty;
}

There is quite a bit more to the script, but these functions are the bottle neck. I'm also caching to a text file, which means it's faster after the first run.

How can I speed up my script to retrieve the final url and get the image urls from the links in the RSS feed?

There is no way to speed up following redirects. The client has to make a new request, and that takes the time it takes. With `CURLOPT_FOLLOWLOCATION` cURL does this automatically already, so there is no point where you could possibly interject to make anything faster. — CBroe, Feb 09 '15 at 13:12

Moid · Accepted Answer · 2015-02-09T17:12:38.267

You can use Facebook's OG API. Facebook use it to scrap important info from any URL. It is pretty much fast as compared the usual scraping method.

You can work it like this..

og_scrapping.php:

    function meta_scrap($url){
        $link = 'https://graph.facebook.com/?id='.$url.'&scrape=true&method=post';
        $ch = curl_get_contents($link);
        return json_decode($ch);
    }

Then simply call it anywhere after including the og_scrapping.php print_r(meta_scrap('http://www.example.com'); You will get an array and then you can get selective content according to your need.

For title, image, url and description you can get them by:

$title = $output->title;
$image = $output->image[0]->url;
$description = $output->description;
$url = $output->url;

Major issue occurs while scrapping for images. Getting title and description is easy. Read this article to get images in a faster way. Also this will help you to save a few seconds.

score 2 · Answer 2 · edited May 23 '17 at 11:50

I'm afraid there isn't much you can do to speed up the extraction process itself. One possible improvement would be approaching the image extraction string-wise, that is - while usually strongly advised against - focusing on the og: tags using regex.

This has the major downside of breaking easily if a change to the source is ever made, and not having significant speed advantage over the more stable DOM parsing approach.

I'm also caching to a text file, which means it's faster after the first run.

On the other hand, you might go with an approach that always serves only the cache to the user, and renews it using an asynchronous call if needed upon each request.

As CBroe commented on your answer:

There is no way to speed up following redirects. The client has to make a new request, and that takes the time it takes. With CURLOPT_FOLLOWLOCATION cURL does this automatically already, so there is no point where you could possibly interject to make anything faster.

Which means it is not a heavy task on your webserver, but instead a lengthy one because of the numerous requests it has to perform. This is a very good ground to start thinking asynchronous:

you receive a request that is looking for the RSS items,
you serve a response very quickly from the cache,
you send an asynchronous request to rebuild the cache if needed - this is the longest part due to the redirects and DOM parsing, but the original client/peer requesting the list of RSS items does not have to wait for this operation to complete; that is, for this list, it only takes time to send the rebuild request itself, a few microseconds,
you return with the cached items.

Asynchronous shell exec in PHP

If you'd go down this route, in your case, you'd meet the following advantages:

rapid content serving with high loading speed,
no loading speed reduction when the cache is being rebuilt.

But also, the following disadvantages:

the first user to request an updated feed does not immediately* receive the newest item(s),
subsequent users after the first one do not immediately* receive the newest item(s) until the cache is ready.

*Good news is, you can almost perfectly eliminate all disadvantages using a cyclic, timed AJAX request that checks if there are any new items in the RSS items cache.

If there are, you can display a message on top (or on bottom), informing the user about the arrival of new content, and append that content when the user clicks on the notice.

This approach - compared to simply always serving cached content without the cyclic AJAX call - reduces the delay between live RSS appearance and item appearance on your site to a maximum time of n + m, where n is the AJAX-request interval, and m is the time it takes to rebuild the cache.

score 1 · Answer 3 · answered Feb 11 '15 at 18:07

Meta are stored in the "head" element.

In your Xpath, you have to consider the head element :

$query = '//head/meta[starts-with(@property, \'og:\')]';

You lose some time retrieving, storing and parsing the whole html file when you can stop the retrieval after ending of the "head" element. Also, why getting a 40k web page when you only want 1k?

You "might" consider stopping the retrieval after seing the ending "head" element. It can speed up the thing when there is no other thing to do, but it is a naughty-not-always-working-hack.

Quickest & Efficient way of retrieving article final URL and images

3 Answers3