4

I need to include an HTML page (generated by ASP.net) in a PHP page.

To do it I use:

echo file_get_contents("http://example.com");

But in this way my server needs to download the page each time my page is opened.

I'd like to add a cache system but I need to refresh the cache everytime the example.com content changes.
What is the best method (if there's any) to detect if the content is changed without download each time the entire page?

Here the HTTP header of the remote page:

HTTP/1.1 200 OK => 
Cache-Control => no-cache
Pragma => no-cache
Content-Length => 63648
Content-Type => text/html; charset=utf-8
Expires => -1
Server => Microsoft-IIS/7.5
Set-Cookie => ASP.NET_SessionId=xxxxxxxxxxxxxxxx; path=/; HttpOnly
X-Powered-By => ASP.NET
X-AspNet-Version => 4.0.30319
X-UA-Compatible => chrome=1
X-CID => 2-18
Date => Thu, 12 Sep 2013 08:54:59 GMT
Connection => close

Another site gives me these:

Server Response HTTP/1.1 200 OK
HTTP/1.1 200 OK
Cache-Control: no-cache
Pragma: no-cache
Content-Length: 65367
Content-Type: text/html; charset=utf-8
Expires: -1
Server: Microsoft-IIS/7.5
Set-Cookie: ARRSID=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx;Path=/;Domain=.example.com
Set-Cookie: ASP.NET_SessionId=xxxxxxxxxxxxxxxxxxx; path=/; HttpOnly
X-Powered-By: UrlRewriter.NET 2.0.0
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
X-UA-Compatible: chrome=1
X-Powered-By: ARR/2.5
X-Powered-By: ASP.NET
X-CID: 1-18
Date: Thu, 12 Sep 2013 08:56:03 GMT
Fez Vrasta
  • 14,110
  • 21
  • 98
  • 160
  • 1
    depends on your application and the page, maybe you can serve it from the cache and refresh the cache every hour or everyday depends on how frequently does that page change. – trrrrrrm Sep 12 '13 at 08:32
  • Can't see a way of reading the page without downloading at least part of it. With existing files you can use `fread` to get a small part of it, but not sure that sort of 'fix' is available with `curl`. – MDEV Sep 12 '13 at 08:33
  • @ra_htial the page has a comments system and I don't know how often people write comments. – Fez Vrasta Sep 12 '13 at 08:34
  • 1
    for caching on your server see [this](http://stackoverflow.com/a/5263017/1273830). for knowing when the source changed, you'd have to find something unique about the server. Like does the server have `content-length` header? If so, you can know the page is refreshed if this value changed. But if the site is not totally in your control, or you have no way of knowing exactly when the page changed, you would want to refresh the file cached on your server every now and then, probably using a cron job. Edit: also check out if the server has Last-Modified header as Ruben said. – Prasanth Sep 12 '13 at 08:37
  • I guess there is no way to understand if a page was modified or not without fully downloading it. – Konstantin Sep 12 '13 at 08:38
  • 1
    @Konstantin that's why guessing isn't a good idea =) – AD7six Sep 12 '13 at 08:38
  • @AD7six please describe such a method ;-) – Konstantin Sep 12 '13 at 08:39
  • @FezVrasta you need to show the headers for `/page.aspx` - since that's where you get redirected (though, given it's a dynamic page for something you don't own/control - normal header-based techniques may not work). – AD7six Sep 12 '13 at 08:42
  • @FezVrasta Why? What's the purpose of this? Are you just trying to create a cached iframe alternative? – MDEV Sep 12 '13 at 08:42
  • try to send the `If-Modified-Since` header as `hexblot` mentioned below and post the result – trrrrrrm Sep 12 '13 at 08:43
  • @FezVrasta: the updated headers are redirect ( 302 ) headers, not the final ones. – Nick Andriopoulos Sep 12 '13 at 08:43
  • 1
    @Konstantin or you could read the existing answers and their references =) – AD7six Sep 12 '13 at 08:44
  • @AD7six all existing answers at the moment rely on specific http headers which may not be present in a response. – Konstantin Sep 12 '13 at 08:49
  • the page doesn't support the `If-Modified-Since` (tested with http://www.feedthebot.com/tools/if-modified/) – Fez Vrasta Sep 12 '13 at 08:50
  • @Konstantin true, but you don't know that yet since the headers for the requested page are not in the question (so while quite likely, that's another guess). – AD7six Sep 12 '13 at 08:52
  • now the headers should be right – Fez Vrasta Sep 12 '13 at 08:55

4 Answers4

3

Assuming your server supports it, the best way is to use the headers of said page.

Specifically, check If-Modified-Since, which does exactly what you need, provided your webserver supports it.

Additionally, you could check for the ETags header, which will provide an identifier for the content. Changes on the page should change the identifier (commonly used is the timestamp of page generation). Again, this depends on server configuration.

Nick Andriopoulos
  • 10,313
  • 6
  • 32
  • 56
  • The best answer so far. Please also mention ETags and you will have my upvote. – Tadeck Sep 12 '13 at 08:36
  • @Tadeck: though just one method would be enough, but adding another never hurt - added. – Nick Andriopoulos Sep 12 '13 at 08:39
  • i agree, this is the best answer so far. – trrrrrrm Sep 12 '13 at 08:40
  • sadly the remote server doesn't support If-Modified-Since, how can I check if it supports ETags? – Fez Vrasta Sep 12 '13 at 09:04
  • @FezVrasta: from the updated headers above, it supports neither. Assuming you don't have control over the server, your best chance is to use Ruben Serrate Pardo's answer above. If you do, you could enable them. – Nick Andriopoulos Sep 12 '13 at 10:11
  • But I've not a `last-modified` in headers.. what about the `Content-Length`? – Fez Vrasta Sep 12 '13 at 11:08
  • @FezVrasta: you can use it, but it's not as dependable. That's basically how many bytes the page is. Assuming most changes will change the page size, it is an indication, but not 100%. – Nick Andriopoulos Sep 12 '13 at 11:22
  • I think can be enough to make the life of my server easier – Fez Vrasta Sep 12 '13 at 11:27
  • @hexblot: In case of `Last-Modified` vs `ETag` it really is not enough, as the server can support any of them, both or none, so giving half the solution slashes chances by significant amount. – Tadeck Sep 12 '13 at 13:42
  • @FezVrasta: You can cache the page, if the server (target server) will give you that ability. In other cases, you cannot cache it utilizing HTTP features and you will need to resort to some intermediary part that will fetch it for you and then will allow caching. Other options are just guesses, but you can issue `HEAD` request and try to identify if the site changed. – Tadeck Sep 12 '13 at 13:45
3

You could use cURL to retrieve the headers an reload the file or serve your cached version depending on the value of the

Last-Modified: Fri, 14 Sep 2012 21:51:00 GMT

header

Ruben Serrate
  • 2,724
  • 1
  • 17
  • 21
  • does this method need some special configuration on the remote page? Isn't the page "last-modified" each time is generated by ASP.net? – Fez Vrasta Sep 12 '13 at 08:37
  • but if you CURL the URL basically you loaded the page. so it would be the same if you check the last-modified and display the page or checked the last-modified and serve it from the cache – trrrrrrm Sep 12 '13 at 08:39
  • 2
    @ra_htial you can use curl_setopt with an option like CURLOPT_NOBODY so that you only request the headers and not the whole page. – Ruben Serrate Sep 12 '13 at 08:44
  • Ya true, just checked http://stackoverflow.com/questions/3834143/does-curlopt-nobody-still-download-the-body-using-bandwidth – trrrrrrm Sep 12 '13 at 08:48
1

I've used the solution by @Prasanth but it was just a comment and I can't set as answer, so I'm writing it here.
If he wants write the answer here I'll set it as solution.

For caching on your server see this. for knowing when the source changed, you'd have to find something unique about the server. Like does the server have content-length header? If so, you can know the page is refreshed if this value changed. But if the site is not totally in your control, or you have no way of knowing exactly when the page changed, you would want to refresh the file cached on your server every now and then, probably using a cron job. Edit: also check out if the server has Last-Modified header as Ruben said.

So checking the content-length does the trick.

Fez Vrasta
  • 14,110
  • 21
  • 98
  • 160
-4

int filemtime ( string $filename ) returns you last modification date - if it was AFTER your caching time - you can reload the page, if not get it from cache.

Silwerclaw
  • 685
  • 5
  • 13