Computing an ETag for a REST API

Question

We're building REST APIs in which we use ETag for two uses:

Save bandwidth by allowing the client to avoid reloading a resource (not that important to us)
Address concurrency issues (lost update problem)

From a practical perspective, I'm wondering what to use to compute the ETag.

Item hash

We're using a hash of the (json dump of the) item object sent in the response. This works fine. It is easy to check on a PUT request: pull the item from DB, compute hash, compare. However, it makes the separation of concerns a bit "leaky": the layer that builds the response from the item is sort of interleaved with the layer responsible for ETag computation. Besides, additional data (response headers) may matter and if they do, sending a 304 just because the item itself didn't change while headers did might not be appropriate.
Response hash

Another approach would be to just hash the response before sending it. Doing this makes the ETag layer much cleaner for the computation part. However, on a PUT request, we can't just pull the item from DB to check the ETag as we don't have the extra data.

The first approach (compute item hash) seems appropriate for case 2 concurrency issues. The second approach (compute payload hash, including metadata, headers) would be appropriate for case 1 save bandwidth.

Putting every bit of the response (including headers) in the request seems right, as every change there may be relevant and require the client to refresh its cache. But I don't know how to manage concurrency on PUT or DELETE requests with such an ETag.

From a practical perspective, should we use item hash or response hash and how can we handle both cases with one of them?

score 2 · Accepted Answer · answered Jan 24 '22 at 20:38

Given your description I think the response hash is the only one that makes sense here.

First, in order to use conditional requests to avoid the lost update problem, the validators need to be strong.

An origin server MUST use the strong comparison function when comparing entity-tags for If-Match (Section 2.3.2), since the client intends this precondition to prevent the method from being applied if there have been any changes to the representation data.

Strong validators can only have the same value when the representations are bit-for-bit identical. But if, as you say, "additional data may matter" beyond the item hash, then you are not in a position to decide on a strong ETag at that time. So you simply could not do an item hash and be consistent with the specification in that case.

Of course, you could decide that additional data does not matter, in which case you could still do the item hash and be consistent with the specification. But that obviates the one downside you gave for the response hash idea ("we can't just pull the item from DB to check the ETag as we don't have the extra data").

Put differently: you need a strong ETag to avoid lost updates, and strong validators must change "whenever a change occurs to the representation data that would be observable in the payload body of a 200 (OK) response to GET." So to construct the ETag you have to know everything you would know to respond to a GET in any case, so there's no downside to doing it in the response layer.

I guess you're right and the issue I have with this is mostly related to my implementation and framework. I shall ask a more specific question centered on the implementation details that bug me. I suppose I used shortcuts that were practical and now doing things right look like a regression (in terms or code architecture and performance). I need to think this through, but thanks for this, already. — Jérôme, Feb 10 '22 at 22:45
E.g. I could just call my GET view function to compute the ETag before executing PUT or DELETE view functions. But this means pulling twice the entity from DB (once for the GET and once for the PUT). Unless I manage to mutualize the DB lookup, but I don't see how to do that neatly. Looks like I'd need to rethink the way I design my view functions. For the record, here's how I do it right now: https://flask-smorest.readthedocs.io/en/latest/etag.html#etag-computed-with-api-response-data. — Jérôme, Feb 10 '22 at 22:51
@Jérôme: Yes, implicit in that library design is the idea that the `ETag` is based on the item data. And it's perfectly fine to do it that way, you just have to accept that you won't be able to add "additional data" to your responses. Note that `ETags` do not need to be, and ideally wouldn't be, hashes of arbitrary data. A more canonical example would be a version identifier stored in the database, in which case it would of course make perfect sense to base the `ETag` on the raw data. So it's up to you how you do it, you just have to be consistent with the specification. — Kevin Christopher Henry, Feb 11 '22 at 03:18

Computing an ETag for a REST API

1 Answers1

Linked