50

I’ve been reading a book and I have a particular question about the ETag chapter. The author says that ETags might harm performance and that you must tune them finely or disable them completely.

I already know what ETags are and understand the risks, but is it that hard to get ETags right?

I’ve just made an application that sends an ETag whose value is the MD5 hash of the response body. This is a simple solution, easy to achieve in many languages.

  • Is using MD5 hash of the response body as ETag wrong? If so, why?

  • Why the author (who obviously outsmarts me by many orders of magnitude) does not propose such a simple solution?

This last question is hard to answer unless you are the author :), so I’m trying to find the weak points of using an MD5 hash as an ETag.

Palec
  • 12,743
  • 8
  • 69
  • 138
Pablo Fernandez
  • 103,170
  • 56
  • 192
  • 232
  • 10
    It's never safe to assume that an author outsmarts you one little bit. – spender Feb 18 '10 at 00:32
  • 23
    @spender: Agreed, but it is even less safe to assume you outsmart the author. – Oddthinking Feb 18 '10 at 00:35
  • 16
    and we won't even touch the smarts of web commentators, one way or the other ;-) – Will Hartung Feb 18 '10 at 00:44
  • 2
    you made "MD5 hash of the response body" that means your server had to generate the same response again to calculate the hash!? if so, you saved transmission of data but not server load. – Kambiz Jul 24 '13 at 15:08
  • @Kambiz Even server load can be saved, when after the first computation the value gets cached, you know what events may invalidate it and can subscribe to all such events. This may get complicated... – maaartinus Jun 04 '17 at 01:57

4 Answers4

55

ETag is similar to the Last-Modified header. It's a mechanism to determine change by the client.

An ETag needs to be a unique value representing the state and specific format of a resource (a resource could have multiple formats that each need their own ETag). Not unique across the entire domain of resources, simply within the resource.

Now, technically, an ETag has "infinite" resolution compared to a Last-Modified header. Last-Modified only changes at a granularity of 1 second, whereas an ETag can be sub second.

You can implement both ETag and Last-Modified, or simply one or the other (or none, of course). If you Last-Modified is not sufficient, then consider an ETag.

Mind, I would not set ETag for "every" resource. Basically, I wouldn't set it for anything that has no expectation of being cached (dynamic content notably). There's no point in that case, just wasted work.

Edit: I see your edit, and clarify.

MD5 is fine. The only downside is calculating MD5 all the time. Running MD5 on, say, a 200K PDF file, is expensive. Running MD5 on a resource that has no expectation of being cached is simply wasteful (i.e. dynamic content).

The trick is simply that whatever mechanism you use, it should be as cheap as Last-Modified typically is. Last-Modified is, again, typically, a property of the resource, and usually very cheap to access.

ETags should be similarly cheap. If you are using MD5, and you can cache/store the association between the resource and the MD5 hash, then that's a fine solution. However, recalculating the MD5 each time the ETag is necessary, is basically counter to the idea of using ETags to improve overall server performance.

Will Hartung
  • 115,893
  • 19
  • 128
  • 203
  • 1
    Thanks. In my particular case I already have the MD5 because I'm digitally signing the requests, but I see this might be a performance problem for other scenarios. Thanks! – Pablo Fernandez Feb 18 '10 at 01:54
  • 19
    "an ETag that JUST HAPPENS to be the Last Modified date (i.e. the same text) meets all the criteria necessary for an ETag". This isn't true because both a Gzipped and unGzipped response would have the same modified date; they should however have different Etags: https://issues.apache.org/bugzilla/show_bug.cgi?id=39727 – johnstok Feb 11 '11 at 13:57
  • 17
    "Running MD5 on, say, a 200K PDF file, is expensive" - usually not, on modern processors MD5 is many times faster than even Gbit Ethernet, let alone end-user Internet connections. If your ETags have a chance of avoiding even 1% of transfers, it probaly makes up for the CPU time used. – intgr Feb 18 '14 at 16:03
  • @Will I edited your answer to make if a bit more factually correct, please feel free to further edit if you believe this is not yet optimal. – Félix Adriyel Gagnon-Grenier May 27 '21 at 18:09
9

We're using etags for our dynamic content in instela.

Our strategy is at the end of output generating the md5 hash of the content to send and if the if-none-match header exists, we compare the header with the generated hash. If the two values are the same we send 304 code and interrumpt the request without returning any content.

It's true that we consume a bit cpu to hash the content but finally we're saving much bandwidth.

We have a facebook newsfeed style main page which has different content for every user. As the newsfeed content changes only 3-4 time per hour, the main page refreshes are so efficient for the client side. In the mobile era I think it's better to spend a bit more cpu time than spending bandwidth. Bandwidth is still more expensive than the CPU, and it's a better experience for the client.

Cagatay Gurturk
  • 7,186
  • 3
  • 34
  • 44
2

Having not read the book, I can't speak on the author's precise concerns.

However, the generation of ETags should be such that an ETag is only generated once when a page has changed. Generating an MD5 hash of a web page costs processing power and time on the server; if you have many clients connecting, it could start to cause performance problems.

Thus, you need a good technique for generating ETags only when necessary and caching them on the server until the related page changes.

Dancrumb
  • 26,597
  • 10
  • 74
  • 130
1

I think the perceived problem with ETAGS is probably that your browser has to issue and parse a (simple and small) request / response for every resource on your page to check if the etag value has changed server side.

I personally find these extra small roundtrips to the server acceptable for often changing images, css, javascript (the server does not need to resend the content if the browser's etag is current) since the mechanism makes it quite easy to mark 'updated' content.

ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • The problem mentioned in the book is that you need to come up with a special and maybe __smart__ strategy (the author even encourages to drop support from etags if you cannot find a good strategy). That's what I'm finding weird, is MD5 a good solution? if so why not just say that? – Pablo Fernandez Feb 18 '10 at 00:37
  • A proper `max-age` or `Expires` would let the client know how much to wait without sending even that tiny "is there anything new?" request. So you can save the roundtrips too. – Nicolás Feb 18 '10 at 00:38
  • @Pablo Fernandez: MD5 is fine, but I personally would not hash the entire contents of the file. Hashing the 'last file modification date' should prove enough. About the `why not just say that?` bit: the answer is probably right in the book title (High performance web sites). Etags (and their roundtrips) do add some overhead and could be an important factor to consider on a heavily loaded webserver (but at the same time they add flexibility)... – ChristopheD Feb 18 '10 at 00:42
  • 1
    @Nicolás: true, but `max-age` or `expires` can't make any guarantees for you that the client is always (!) receiving the most up-to-date content. – ChristopheD Feb 18 '10 at 00:44
  • 1
    Hashing the modification date would be useless. If you're going to do that, you might as well drop ETags and let the client use Last-Modified + If-Modified-Since. The whole point of ETags is that they have better than 1-second resolution, and can go "back" to an ETag sent previously. – Nicolás Feb 18 '10 at 00:45
  • @Nicolás: very true (point taken). The last-modified / if-modified-since combo would behave nearly identical to an etag signifying a last-changed-timestamp (and they are probably a better fit for this job ;-). – ChristopheD Feb 18 '10 at 00:52
  • @Nicolás I think it can sometimes be useful for Cache-Control: no-cache. Going back to a previous version does not seem like a common case to me at all. If the resource always changes, then returning a timestamp works fine. – Mo'in Creemers May 01 '23 at 11:16