HTTP Spec: PUT without data transfer, since hash of data is known to server

Question

Does the HTTP/WebDav spec allow this client-server dialog?

client: I want to PUT data to /user1/foo.mkv which has this hash sum: HASH
server: OK, PUT was successful, you don't need to send the data since I already know the data with this hash sum.

Note: This PUT is an initial upload. It is not an update.

If this is possible, a way faster file syncing could be implemented.

Use case: The WebDAV server hosts a directory for each user. The favorite video foo.mkv gets uploaded by several users. In this example the favorite video is already stored at this location: /user2/myfoo.mkv. The second and following uploads don't need to send any data, since the server already knows the content. This would reduce a lot of network load.

Preconditions:

Client and server would need to agree on the hash algorithm beforehand.
The server needs to store the hash-value of already known files.

It would be very easy to implement this in a custom client and server. But that's not what I want.

My question: Is there an RFC or other standard that allows such a dialog?

If there is no standard yet, then how to proceed to get this dream come true?

Security consideration

With the above dialog it would be able to access the content of know hashes. Example an evil client knows that there is a file with the hash sum of 1234567.... He could do the above two steps and after that the client could use a GET to download the data.

A way around this to extend the dialog:

client: I want to PUT data which has this hash sum: HASH
server: OK, PUT would be successful, but to be sure that you have the data, please send me the bytes N up to M. I need this to be sure you have the hash-sum and the data.
client: Bytes N up to M of the data are abcde...
server: OK, your bytes match mine. I trust you. Upload successful, you don't need to send the data any more.

How to get this done?

Since it seems that there is not spec yet, this part of the question remains:

How to proceed to get this dream come true?

It's an interesting idea, but it opens a can of worm with respect to access rights (knowing the hash must not allow people to see/get content they have no access to). — Julian Reschke, Sep 28 '15 at 10:48
@JulianReschke good point. But if I know the hash. Then I was able to compute the hash from the data. Or someone else was able to compute the hash and gave me the hash. If someone is able to compute the hash, then he has the data. You can't get unauthorized access to a specific data. But you could write a brute force script which tries random hash values... and maybe the script is lucky and gets some random data that the server already knows. — guettli, Sep 28 '15 at 12:10
Sounds like [ETag](https://en.wikipedia.org/wiki/HTTP_ETag)s... — user193130, Sep 29 '15 at 21:29
@user193130 I know ETags only for http-GET. Do they work for PUT? — guettli, Sep 30 '15 at 04:38
Regarding the security consideration, if the evil client knows the hash, how can the data leak out? In the original dialog that you envisioned, the server only responds that it has a file with the hash, but it doesn't actually return that file. — user193130, Sep 30 '15 at 23:22
@user193130 after the PUT without data transfer, the client could do a GET and get the data. Since the file was uploaded by the client, the client is allowed to read it. This would be a possible data leak. — guettli, Oct 01 '15 at 11:12
Still not sure what you mean -- if you want the server to respond to a PUT request with a specific hash, saying that the file with that hash already exists anywhere on the server (my understanding from your question), then the client wouldn't know where that file exists on the server and would not be able to GET the resource. And if the file was originally uploaded by the client, then reading it wouldn't make a difference since the client already has the file. — user193130, Oct 01 '15 at 19:55
@user193130 Step one is this: "client: I want to PUT data to /user1/foo.mkv which has this hash sum: HASH". The client provides the specific hash, not the server. The client sends a path with the PUT request. The client knows where it will be stored. Where the serer stores the file physically does not matter. — guettli, Oct 02 '15 at 08:14
@guettli I didn't fully understand your question. Are you willing to modify the client and server code to get this done or are you just looking for something that already works? — svsd, Oct 03 '15 at 18:11
@skmrx I was unsure whether there is already a spec which allows upload without data transfer. Now, after asking the question here, I am more sure (even not 100% yet). I see a lot of useless uploads. In my case I have a ownCloud WebDAV server and the FolderSync app on android. Both are from different vendors. I don't like proprietary solutions like DropBox. I would love to see a spec and several different servers and clients. — guettli, Oct 04 '15 at 07:15
I've updated my answer based on your updated question. Also, I still don't understand what you mentioned as a security problem -- please see consideration #4 in the answer. — user193130, Oct 05 '15 at 16:39
@guettli BTW, just wondering what your opinion of ownCloud is so far? I thought about using ownCloud as well but haven't gotten around to it yet. The idea that Dropbox controls my data and everything is opaque also puts me off :( — user193130, Oct 05 '15 at 17:03
@user193130 I use owncloud, but there are some major issues (my point of view). Here are two: They don't support ETags. I have seen useless downloads. Second: The android app is not able to store the data on the external sd-card. — guettli, Oct 06 '15 at 06:18
@guettli Ah I see thanks for the info. I hope my updated answer helps you if you decide to implement it yourself. — user193130, Oct 07 '15 at 01:58
@user193130 thank you very much. Maybe my goal is too high .... but I don't want to implement this. I want a official spec :-) — guettli, Oct 07 '15 at 05:54

score 7 · Accepted Answer · edited May 23 '17 at 12:14

From what you described, it seems like ETags should be used.

It was specifically designed to associate a tag (usually an MD5 hash, but can be anything) with a resource's content (and/or location) so you can later tell whether the resource has changed or not.

PUT requests are supported by ETags and are commonly used with the If-Match header for optimistic concurrency control.

However, your use case is slightly different as you are trying to prevent a PUT to a resource with the same content, whereas the If-Match header is used to only allow the PUT to a resource with the same content.

In your case, you can instead use the If-None-Match header:

The meaning of "If-None-Match: *" is that the method MUST NOT be performed if the representation selected by the origin server (or by a cache, possibly using the Vary mechanism, see section 14.44) exists, and SHOULD be performed if the representation does not exist. This feature is intended to be useful in preventing races between PUT operations.

WebDAV also supports Etags though how it's used may depend on the implementation:

Note that the meaning of an ETag in a PUT response is not clearly defined either in this document or in RFC 2616 (i.e., whether the ETag means that the resource is octet-for-octet equivalent to the body of the PUT request, or whether the server could have made minor changes in the formatting or content of the document upon storage). This is an HTTP issue, not purely a WebDAV issue.

If you are implementing your own client, I would do something like this:

Client sends a HEAD request to the resource check the ETag
- If the client sees that it matches what it has already, do not send anything else
- If it doesn't match, then send the PUT request with the If-None-Matches header

UPDATE

From your updated question, it now seems clear that when a PUT request is received, you want to check ALL resources on the server for the absence of the same content before the request is accepted. That means also checking resources which are in a different location than what was specified as the destination to the PUT request.

AFAIK, there's no existing spec to specifically handle this case. However, the ETag mechanism (and the HTTP protocol) was designed to be generic and flexible enough to handle many cases and this is one of them.

Of course, this just means you can't take advantage of standard HTTP server logic -- you'd need to custom code both the client and server side.

Assumptions

Before I get into possible implementations, there are some assumptions that need to be made.

As mentioned, you need to control both the server and the client
An algorithm needs to be agreed upon for generating the ETag based on the content. This can be MD5, SHA1, SHA2-256, SHA3, a concatenation of a combination of them, etc. I'll just mention the algorithm output as the ETag, but how you do it is up to you.

Possible implementations

These have been ordered from simplest to increasing complexity if the simple case doesn't work for you.

Possible implementation 1

This assumes your server implementation allows you to read the request headers and respond before the entire request is received.

Client computes the ETag for the file/resource to upload.
Client sends a PUT request to the server (location doesn't matter) with the header If-None-Match containing the ETag and continue sending the body normally.
Server checks to see if a resource with the ETag already exists.
Server:
- If ETag already exists, immediately return a 412 response code. Optionally terminate the connection to stop the client from continuing to send the resource (NOTE: This is NOT advisable by the HTTP spec, though not explicitly prohibited. See note 1 below). Yes, a little bandwidth is wasted, but you wouldn't have to wait for the entire request to finish.
- If ETag doesn't exist, wait for the request to finish normally.
Client:
- If the 412 response is received, interpreted it such that the resource already exists and the request needs to be aborted -- stop sending data.

Possible implementation 2

This is slightly more complex, but better adheres to the HTTP spec. Also, this MIGHT work if your server architecture doesn't allow you to read the headers before the entire request is received.

Client computes the ETag for the file/resource to upload.
Client sends a PUT request to the server (location doesn't matter) with the header If-None-Match containing the ETag and an Expect: 100-continue header. The request body is NOT yet sent at this point.
Server checks to see if a resource with the ETag already exists.
Server:
- If ETag already exists, return a 412 response.
- If ETag doesn't exist, send a 100 response and wait for the request to finish normally.
Client:
- If the 412 response is received, interpreted it such that the resource already exists and the request was therefore aborted.
- If the 100 response is received, continue sending the body normally

Possible implementation 3

This implementation probably requires the most work but should be broadly compatible with all major libraries / architectures. There's a small risk of another client uploading a file with the same contents in between the two requests though.

Client computes the ETag for the file/resource to upload.
Client sends a HEAD request (no body) to the server at /check-etag/<etag> where <etag> is the ETag. This checks whether the ETag already exists at the server.
Server code at /check-etag/* checks to see if a resource with that ETag already exists.
Server:
- If ETag already exists, return a 200 response.
- If ETag doesn't exist, send a 404 response.
Client:
- If the 200 response is received, interpreted it such that the resource already exists and do not proceed with a PUT request.
- If the 404 response is received, follow up with a normal PUT request to the intended destination.

Considerations

Although the implementation is up to you, here are some points to consider:

When a resource is added or updated, the ETag and the location should be stored in a database for quick retrieval. It is needlessly inefficient for a server to recompute the hash for every single resource whenever a resource is being uploaded. There should also be an index on the ETag and location fields for quick retrieval.
If two clients upload a resource with the same ETag at the same time, you might want to abort the 2nd one as soon as the 1st one finishes.
Using hashes for ETag means that there's a possibility for collision (where two resource would have the same hash), though in practice, the possibility is extremely slim if a good hash is used. Note that MD5 is known to be weak to intentional collision attacks. If you are paranoid, you can concatenate multiple hashes to make collision a much smaller chance.
In regards to your "security consideration", I still don't see how knowing a hash would lead to retrieval of a resource. The server will only and SHOULD ONLY tell you whether a specific ETag exists or not. Without divulging the location, it's not possible for the client to retrieve the file. And even if the client knows the location, the server SHOULD implement other security controls such as authentication and authorizations to restrict access. Using the resource location solely as a way of restricting access is just security by obscurity, especially since from what you mentioned, the paths seem to follow a pattern by username.

Notes

RFC 2616 indicates this SHOULD NOT be done:

If an origin server receives a request that does not include an Expect request-header field with the "100-continue" expectation, the request includes a request body, and the server responds with a final status code before reading the entire request body from the transport connection, then the server SHOULD NOT close the transport connection until it has read the entire request, or until the client closes the connection. Otherwise, the client might not reliably receive the response message.

Also, DO NOT close the connection from the server side without sending any status codes, as the client will most likely retry the request:

If an HTTP/1.1 client sends a request which includes a request body, but which does not include an Expect request-header field with the "100-continue" expectation, and if the client is not directly connected to an HTTP/1.1 origin server, and if the client sees the connection close before receiving any status from the server, the client SHOULD retry the request.

Thank you for your answer. But is this really the answer to my question? ETags are there to check if a resource has changed or not. I want deduplication for new uploads (not update). Later you say I should do a HEAD request first. This won't work since it is an initial upload. I will update the question to make this more clear. — guettli, Sep 30 '15 at 18:51
My point was you can use ETags to check whether the destination you want to upload to is exactly the same as what you're about to upload. But from what you're saying, the user/client doesn't know the location? In other words, you're allowing an upload to an arbitrary location but you want to check if there's another exact same file anywhere else on the server? AFAIK, there's no such thing built-in and you would need to custom code it on both the server and client side. — user193130, Sep 30 '15 at 23:14
ETags are supposed to be opaque to clients. How would the client derive the ETag of a file the first time without uploading it to the server? — svsd, Oct 03 '15 at 17:59
@skmrx Yes, you're right in that it's normally *supposed* to be opaque -- but it doesn't have to be. It can be any arbitrary string. See [these](http://stackoverflow.com/a/2285584/2891365) [other](http://stackoverflow.com/a/4540/2891365) answers for generating a custom ETag. As long as the OP can control the client and the server, then a custom ETag can be used that can generate a ETag before it's sent to the server. — user193130, Oct 05 '15 at 14:28