8

I have a script that fetch pages everyday and I want to fetch it only if content changed, so that script will run faster and less traffic will be used.

My idea is to fetch header first and compare content-length so that if its differ we fetch whole document, but it's not too much precise, because website could have dynamic parts that makes content-length every time different.

Is there another way, like using some sort of DNS or anything else?

Shubhamoy
  • 3,718
  • 2
  • 19
  • 24
Kref
  • 105
  • 7

3 Answers3

2

I looked for answer for more than 2 days, and nobody could give me universal answer.

So I implemented etag and if-modified-since headers (as Matt Raines and sowa posts here), also to lower traffic I used compression like gzip.

Also there is request header Range, so that i could grap only part of the page as someone told me, but i think it is used only for files not web pages.

Thank you all for your time

Kref
  • 105
  • 7
2

Update local file with remote, iff remote is newer

Cut and paste answer for those who want to
check if a remote file is more up to date than a local one, and update the local file if so:

    // $remotePath = 'http://blahblah.com/file.ext'; 
    // $localPath = '/usr/whatever/app/file.ext';

    $headers = get_headers( $remotePath , 1 );
    $remote_mod_date = strtotime( $headers['Last-Modified'] );
    $local_mod_date = filemtime( $localPath );

    if ( $local_mod_date >= $remote_mod_date ) {
        // Local version up to date 
    } else {
        // Remote file is newer
        $ch = curl_init();

        curl_setopt($ch, CURLOPT_URL, $remotePath);
        // other options here, eg: curl_setopt($ch, CURLOPT_SSLVERSION, CURL_SSLVERSION_TLSv1_2);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);

        $result = curl_exec($ch);

        if (curl_errno($ch)) {
            // handle error : curl_error($ch) 
        }

        curl_close ($ch);

        if ( $result ) {
            // Update local file with remote file contents
            file_put_contents( $localPath, $result );
        } 
    }

With thanks to OP question here, and also this answer.
Created to solve automatic OIDC CA cert renewal (this, and this).

Community
  • 1
  • 1
kris
  • 11,868
  • 9
  • 88
  • 110
0

Does curl_setopt($curl, CURL_HTTPHEADER, ["If-Modified-Since: 2016-04-30 21:00:00"]); work? I get a 304 Not Modified response on a resource that was last modified earlier in the month.

Matt Raines
  • 4,149
  • 8
  • 31
  • 34
  • This will work only on static html pages, if it is dynamic (php, perl, python etc..) page, server will not add Last-Modified response header automatically so that it will not return 304 code – Kref Apr 30 '16 at 21:39
  • No, fair enough. Most of my PHP pages return Last-Modified headers but I appreciate this isn't necessarily the case. But, if I understand the problem correctly, is it "how can identify pages that haven't changed, which don't report Last-Modified or Etag, apart from the bits of the pages which **have** changed?" Because that seems quite ... a challenge ;) – Matt Raines May 01 '16 at 07:13