0

I am currently developing a synchronization service that fetches all user profile pictures from an exchange server. In order to track changes, i have decided to MD5 encode the response body and persist it with the entity in the database in order to compare it further down the path and see if the picture has changed. While the actual picture itself is saved on the disk.

The pictures are 504x504 pixels large and thus, weight around ~27 kb. And since i am taking the hash value of the bytecode, even if the picture matches, i still have to download the 27kb array which leads to almost no speed improvement (aside from the fact that i dont have to replace it on the disk). Multiply that by a huge amount of users and the job takes 20minutes even if all the pictures match.

Is there a way to optimize the synchronization so i do not download the response body if a picture is the same? Here is some code the help you understand better:

entity = restTemplate.getForEntity(
                Constant.EXCHANGE_URL_PREFIX + emailAddress + Constant.EXCHANGE_URL_SUFFIX, byte[].class);

This is how i call the get request.

if (entity.hasBody()) {
 String hexHash = Hex.encodeHexString(MessageDigest.getInstance("MD5").digest(bytes));
 if (!listofHashes.contains(hexHash)) {
    picture.remove();
 } else picture.save();
}

To sum it up: is there a way of detecting webpage changes using restTemplate that does not download the entire page? Thank you in advance.

Edit: Additional research into the ETag header as well as the @Cacheable annotation did not prove succesful.

cristianhh
  • 128
  • 12
  • probably https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified but there is no guarantee the server sends that information along – luk2302 Jun 21 '17 at 09:31
  • unfortunately i have compared multiple request headers with each other, aside from the request-id, other fields are equal – cristianhh Jun 21 '17 at 11:13
  • Did you try to do a `HEAD` request instead of `GET` and check the response headers? Did you try to send the `If-Modified-Since` header in the request? – yinon Jun 27 '17 at 14:01
  • @yinon i tried sending the If-Modified-Since, but not a HEAD request, will look into it – cristianhh Jun 28 '17 at 07:41

3 Answers3

1

It really depends on the capabilities of the server you are communicating with. If that server does not support any of the standard mechanisms (ETag, If-Modified-Since, etc. as mentioned in the comments) or sends any other custom header then you have no choice but to do what you described - calculate the digest of the response body on client side (in your application).

yinon
  • 1,418
  • 11
  • 14
1

you can try with a HTTP GET but requesting the headers only.

then, from the reply verify "content-length" and "last-modified" (if they don't match with the image that you have already stored, then you have to download it again)

for example, doing that for an image in Wikipedia I got these:

content-length: 314402

last-modified: Thu, 31 Oct 2013 14:45:43 GMT

notice, about "content-length":

The Content-Length entity-header field indicates the size of the entity-body, in decimal number of OCTETs, sent to the recipient or, in the case of the HEAD method, the size of the entity-body that would have been sent had the request been a GET. (see more here: http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html)

this is the curl command that I used:

$ curl -X HEAD -I "https://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/            Lifeboat.17-31.underway.arp.jpg/1200px-Lifeboat.17-31.underway.arp.jpg"

HTTP/2 200 
date: Thu, 29 Jun 2017 08:30:29 GMT
content-type: image/jpeg
content-length: 314402
x-object-meta-sha1base36: oboqyviefa9uqy9p7391dxgod784onh
last-modified: Thu, 31 Oct 2013 14:45:43 GMT
etag: 188492bd99a0032624df62205d156bb4
x-timestamp: 1383230742.02258
x-trans-id: tx73ff02723dc5476c92e0a-005953e448
x-varnish: 894182014 897225224, 41759639 11075541, 415722130
via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4
accept-ranges: bytes
age: 54876
x-cache: cp1063 hit/1, cp3045 hit/72, cp3049 pass
x-cache-status: hit
strict-transport-security: max-age=31536000; includeSubDomains; preload
x-analytics: https=1;nocookies=1
x-client-ip: 82.181.132.52
access-control-allow-origin: *
access-control-expose-headers: Age, Date, Content-Length, Content-Range, X-Content-    Duration, X-Cache, X-Varnish
timing-allow-origin: *
Jose Zevallos
  • 685
  • 4
  • 3
1

Conditional GET

If your server implements the standard of Http1.1, you can use either of the following pairs to achieve:

  • Last-Modified/If-Modified-Since
  • ETag/If-None-Match

Server should return 304 (Not Modified) if etag match or not modified since last fetch date.

Examples:

Request Header:

If-Modified-Since:Sat, 06 Aug 2016 05:22:27 GMT
If-None-Match:"02c7fd69fa875302f71b714fa2787cc95fa88245"

Response Header:

Last-modified:Sat, 04 Apr 2015 09:05:44 GMT
Etag: "02c7fd69fa875302f71b714fa2787cc95fa88245"

Conclusion

  • Actually, what you have done should be done by your exchange server not by clients, and that is what 304 for;
  • Last-Modified way can be affected by the time drifting in distributed system, while Etag won't;
  • Etag, on the other hand, may involve the information of inode of file system, so moving file may also affect Etag value;

Ref

Tony
  • 5,972
  • 2
  • 39
  • 58